VLG Group
Group Meetings
Y. LeCun's website
CS at Courant
Courant Institute

Sparse Coding for Feature Learning

  • Time Period: 2008 - present.
  • Participants: Karol Gregor, Koray Kavukcuoglu, Arthur Szlam, Rob Fergus, Yann LeCun (Courant Institute/CBLL).
  • Sponsors: ONR, NSF.
  • Description: We are developing unsupervised (and semi-supervised) learning algorithms to learn visual features. The methods are based on the idea that good feature representation should be high dimensional (so as to facilitate the separability of the categories), should contain enough information to reconstruct the input near regions of high data density, and should not contain enough information to reconstruct inputs in regions of low data density. This lead us to use sparsity criteria on the feature vectors. Many of the sparse coding methods we have developed include a feed-forward predictor (a so-called encoder) that can quickly produce an approximation of the optimal sparse representation of the input. This allows us to use the learned feature extractor in real-time object recognition systems. Variant of sparse coding are proposed, including one that uses group sparsity to produce locally invariant features, two methods that separate the "what" from the "where" using temporal constancy criteria, and two methods for convolutional sparse coding, where the dictionary elements are convolution kernels applied to images.

Examples of filters learned from natural image patches using the Predictive Sparse Decomposition algorithm (PSD).

Sparse Coding

The objective of sparse coding is to reconstruct an input vector (e.g. an image patch) as a linear combination of a small number of vectors picked from a large dictionnary. This is common known as basis pursuit.

In the most common form, the feature vector Z that represents an input vector Y (the optimal code for Y), is the one that minimizes the following L2-L1 energy function:

The process can be described by a factor graph:

where red rectangles represent factors, circles represent variables, and blue modules represent deterministic functional relationships.

As shown by Olshausen and Field in 1997, the basis functions can be learned by minimizing the free energy, averaged over a training set of input samples with respect to the dictionary matrix Wd whose columns are the dictionary elements. The norm of the dictionary elements is constrained to be less than 1. The minimization can be performed through stochastic gradient descent.

Predictive Sparse Decomposition (PSD)

One problem with traditional sparse coding is that inference is somewhat slow. Give an input vector, finding the corresponding code vector requires an L2/L1 optimization. Having to do this for every patch in an image would preclude the use of sparse coding for high-speed image recognition.

Predictive Sparse Decomposition (PSD) alleviates this problem by training a simple, feed-forward encoder module to predict an approximation to the optimal sparse code. The factor graph is shown below.

This could be seen as a kind of auto-encoder. The energy function has an additional term, which measure the discrepancy between the predicted code and actual code. The encoder can take several function forms. In the simplest instance, it is simply a linear transform, followed by a sigmoid non-linearity, and a trainable diagonal gain matrix.

The animation below shows the filters being learned by the PSD learning procedure, when trained on 12x12 pixel patches from natural images.

Learned Iterative Shrinkage and Thresholding Algorithm (LISTA)

The simple encoder described above generally doesn't predict sparse codes. In this work (published at ICML 2010), we propose to build the encoder architecture so as to follow that of the so-called ISTA or FISTA algorithm (Fast Iterative Shrinkage and Thresholding Algorithm), which has been shown to converge to the optimal sparse code solution. We build the encoder as a feed forward network with the same architecture as an time-unrollled version version of the ISTA/FISTA algorithm (trucated to a few iterations). The block diagram of ISTA/FISTA and the corresponding encoder is shown below. With ISTA and FISTA, the two matrices We and S have prescribed values. We is the transpose of the dicionary matrix, and S is the identity minus some constant times the gram matrix of the dictionary elements. In our method, called LISTA (Learned ISTA), the matrices are trained so as to produce the best approximation of the optimal sparse code with the given number of iterations.

The approximations are shown experimentally to be considerably better than simply truncating the FISTA algorithm to the same number of iterations. A learned version of the Coordinate Descent algorithm (which is faster than FISTA) is also presented.

Convolutional PSD

A problem with sparse coding, when applied to images, is that it is trained on individual patches, yet it is often used convolutionally (applied to sliding windows over the whole image). Because the training is done on indivudual patches, the dictionary elements must cover the entire patch. Hence we the same Gabor filter replicated and multiple shifted locations over the patch. This is very wasteful when features are to be extracted over entire images using a sliding window approach: the code ends up being extremely redundant.

Convolutional Sparse Coding consists in viewing the dictionary elements as convolution kernels applied to a large image region, as shown in the diagram below

With this method, the dictionary elements are considerably less redundant. Since there are shifted versions of each dictionary element because of the convolution operation, the algorithm doesn't need to learn multiple shifted replicas of each filter to cover the patches. This results in considerably more diverse filters, such as center-surround, cross and corner detectors, Gabors at multiple frequencies, etc....

Our colleagues Zeiler, Taylor, Krishnan, and Fergus described such a method, called deconvolutional networks in a CVPR 2010 publication. Our method is a convolutional version of PSD, which includes a fast encoder. The fast encoder uses the soft-shrinkage function as its non-linearity.

Sparse features learned with Convolutional PSD yield much better results on Caltech-101 than sparse features learned with conventional sparse coding (or PSD) trained at the patch level:

Convolutional Matching Pursuit

Matching Pursuit is a "greedy" form of sparse coding. We have proposed a convolutional form of Matching Pursuit that produces extremely parsimonious representations of images.

The picture below (left) shows examples of filters learned with Convolutional MP. On the right, we see an image of a face (pre-processed with high-pass filtering and contrast normalization), together with its reconstruction with Convolutional MP using only 40 non-zero coefficients.

Learning Invariant Features with Group Sparsity

Group sparsity consists in grouping components of the sparse code into groups. The sparsity criterion applies to groups. It tends to minimize the number of groups that contain non-zero code components, but if one group is active, it doesn't restrict the number of non-zero components within the group. This is acheived using an L2 norm of the code components in each group as the sparsity criterion.

During learning, the group sparsity criterion tends to produce filters within a group that fire together. Hence, filters within a group tend to be very similar (e.g. Gabor edges at similar orientations).

In the chart below, the code components were arranged in a 2D toroidal topology, and the groups were overlaping sqquares (as shown). The filters arrange themselves in a smoothly-varying topographic map.

Learning Locally-Connected Filters with Group Sparsity

Sparse coding was originally proposed as a model of simple cells in the visual cortex. In the present project, the filters are arranged over an input image. EAch filter has a local receptive field on the image. A layer of complex-cell-like pooling combines the outputs of local groups of units. A sparsity criterion is applied to these groups. The result is a topographic map with pinwheel patterns similar to those observed in the visual cortex.


arXiv Arthur Szlam, Koray Kavukcuoglu, Yann LeCun: Convolutional Matching Pursuit and Dictionary Training, October 2010. arXiv:1010.0422
arXiv Karol Gregor, Yann LeCun: Emergence of Complex-Like Cells in a Temporal Product Network with Local Receptive Fields, June 2010. arXiv:1006.0448
148. Karol Gregor and Yann LeCun: Learning Fast Approximations of Sparse Coding, Proc. International Conference on Machine learning (ICML'10), 2010, \cite{gregor-icml-10}. 178KBDjVu

146. Y-Lan Boureau, Francis Bach, Yann LeCun and Jean Ponce: Learning Mid-Level Features for Recognition, Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'10), IEEE, 2010, \cite{boureau-cvpr-10}. 178KBDjVu

144. Yann LeCun, Koray Kavukvuoglu and Clément Farabet: Convolutional Networks and Applications in Vision, Proc. International Symposium on Circuits and Systems (ISCAS'10), IEEE, 2010, \cite{lecun-iscas-10}. 189KBDjVu

140. Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun: What is the Best Multi-Stage Architecture for Object Recognition?, Proc. International Conference on Computer Vision (ICCV'09), IEEE, 2009, \cite{jarrett-iccv-09}. 303KBDjVu

137. Koray Kavukcuoglu, Marc'Aurelio Ranzato, Rob Fergus and Yann LeCun: Learning Invariant Features through Topographic Filter Maps, Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'09), IEEE, 2009, \cite{koray-cvpr-09}. 334KBDjVu

9. Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun: Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition, Computational and Biological Learning Lab, Courant Institute, NYU, Tech Report CBLL-TR-2008-12-01, 2008, \cite{koray-psd-08}. 99KBDjVu

128. Marc'Aurelio Ranzato, Y-Lan Boureau and Yann LeCun: Sparse feature learning for deep belief networks, Advances in Neural Information Processing Systems (NIPS 2007), 2007, \cite{ranzato-nips-07}. 129KBDjVu

127. Marc'Aurelio Ranzato and Yann LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images, Proc. International Conference on Document Analysis and Recognition (ICDAR), 2007, \cite{ranzato-icdar-07}. 139KBDjVu

125. Yann LeCun, Sumit Chopra, Marc'Aurelio Ranzato and Fu-Jie Huang: Energy-Based Models in Document Recognition and Computer Vision, Proc. International Conference on Document Analysis and Recognition (ICDAR), (keynote address), 2007, \cite{lecun-icdar-keynote-07}. 110KBDjVu

119. Marc'Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau and Yann LeCun: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition, Proc. Computer Vision and Pattern Recognition Conference (CVPR'07), IEEE Press, 2007, \cite{ranzato-cvpr-07}. 187KBDjVu

118. Marc'Aurelio Ranzato, Y-Lan Boureau, Sumit Chopra and Yann LeCun: A Unified Energy-Based Framework for Unsupervised Learning, Proc. Conference on AI and Statistics (AI-Stats), 2007, \cite{ranzato-unsup-07}. 257KBDjVu

115. Marc'Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann LeCun: Efficient Learning of Sparse Representations with an Energy-Based Model, in J. Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS 2006), MIT Press, 2006, \cite{ranzato-06}. 152KBDjVu