
This page contains the schedule, slide from the lectures, lecture notes, reading lists,
assigments, and web links.
I urge you to download the DjVu viewer
and view the DjVu version of the documents below. They display faster,
are higher quality, and have generally smaller file sizes than the PS and PDF.
01/26: multilayer learning 
Paper
02/01: Target Prop Algorithms 
02/08: Unsupervised Feature Learning 
 Marc'Aurelio Ranzato: Symmetric Product of Experts.
02/15: Unsupervised Learning 
 A short Review of statistical physics concepts: energy, entropy,
free energy, gibbs distribution (Yann).
 Helmoltz Machines: this
page. Either (Hinton and Zemel, NIPS 1994),
(Zemel and Hinton, Neural Computation 1995), (Hinton, Dayan, Frey, and
Neal, Science 1995), or (Dayan, Hinton, Neal, Zeme, Neural Computation
1995), or some combination thereof (Alyssa, Piotr, Marina).
 Hinton:
Training Products of Experts by Minimizing Contrastive Divergence.
Neural Computation, 2002 (Philip, Marco, Marc'Aurelio).
 Bayesian belief nets
 directed graphical models
 graphical models with loops are generally intractable
 conditional probability tables are invertible with Bayes rule:
the directions of the arrow don't matter in principle
(they do not express causality, just dependency).
 undirected graphical models: the likelihood is a product of potential functions
 Markov random fields: graphical models with local interactions
 undirected graphical models with potential functions
must be normalized explicitely. The partition function problem.
 factor graphs: each potential function is explicitely represented
(a slightly more general representation of graphical models)
 logarithmic representation: the factors are additive energy
functions. The likelihood is proportional to exp(energy).
 energybased models: factors graphs without normalization (no
partition function). Can be used when no explicit probabilities
are required: only the relative values of the energis matter.
 representing common models as factor graphs:
example: an HMM is a "comb".
03/08: Independent Component Analysis, Source Separation 
 Bell AJ, Sejnowski TJ (1995) "An
informationmaximization approach to blind separation and blind
deconvolution," Neural Computation, 7: 11291159.
a shorter/earlier
version of Bell and Sejnowski (Crispy, Jie, George).
 Zibulevsky & Pearlmutter: Blind Source Separation by Sparse
Decomposition in a Signal Dictionary. Neural Computation, 13(4):863882. 2001.
[DjVu]
[PDF]
(Jeremy, Sumit, Koray).
 Hinton G. E., Welling, M., Teh, Y. W, and Osindero, S.
A New View of ICA,
Proceedings of ICA2001, San Diego (Raia, Yury, Jihun).
Links, additional info
Graph Transformer Networks. Sequence labeling with energyBased factor
graphs. (see gradientbased
learning applied to document recognition part 47, page 16 on.
(Yann)
Papers
 John Lafferty, Andrew McCallum, and Fernando Pereira.
Conditional
random fields: Probabilistic models for segmenting and labeling
sequence data. Proceedings of ICML01, 2001 (Alyssa, Piotr,
Marina).
 Michael Collins.
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms.
EMNLP 2002. (Matt, Ayse).
 B. Taskar, C. Guestrin and D. Koller.
MaxMargin
Markov Networks Neural Information Processing Systems Conference
(NIPS03), 2003 (Philip, Marco, Marc'Aurelio).
03/29: Dynamic Graphical Models 
NO CLASS (Snowbird workshop)
04/12: Reinforcement Learning 
Each group will study and explain one class of RL algorithm,
with an application, as listed below. Much of the required
information can be found in Sutton and Barto's book
Reinforcement
Learning: An Introduction. However, a number of other sources of
introductory information is listed below.
 [Crispy, Jie, George]:
QLearning,
ActorCritic
architecture,
RLearning.
 [Jeremy, Sumit, Koray]:
TD(0),
TD(Lambda),
and the
TDGammon
application.
original
TD paper.
one of the
original TDGammon papers by Gerry Tesauro.
 [Raia, Yury, Jihun]:
SARSA,
and
SARSA(Lambda)
algorithms, with the
Acrobot
application
Background Reading Material

