Marc'Aurelio Ranzato: research projects

RESEARCH PROJECTS

[Home Page, Publications]



This is my research statement, and here you can find more details about the projects I have worked on:



OVERVIEW

I am interested in designing general algorithms that can be applied to a variety of tasks while being computationally efficient to be used in real-time applications. Often labeled data is very expensive to produce, it is rare and noisy. In this scenario the paucity of labeled data prevents standard supervised learning algorithms to give good estimates of the parameters of the model making the system fail to generalize well to previously unseen test data. Most of my work has focused on the design of Unsupervised Learning algorithms. These algorithms aim at modeling the distribution of the input data discovering underlying hidden structures, without relying on “labels” specifying the ultimate task a user might be interested in. Unsupervised learning algorithms can leverage large amounts of unlabeled data to produce representations of the input that are more suitable to be used by subsequent supervised modules.

In particular, my research has focused on devising unsupervised algorithms that learn sparse representations, that is, representations whose components are zero most of the times. Such representations are useful because (1) they can be very high-dimensional, and features are more likely to become linearly separable in a high-dimensional space, (2) they are more robust to noise, (3) they often produce a part-based representation that is easier to interpret, and (4) they can be learned very efficiently.

The long term goal of my research is to learn efficient representations that are invariant to irrelevant transformations, and hierarchical. To this end I have worked on Deep Networks that are composed of a sequence of non-linear transformations and I have trained them using many different Unsupervised and Semi-Supervised algorithms. For instance, let us consider images of objects in natural scenes. Objects can appear under different illuminations, view angles, scales, and occlusions. Moreover there are many categories of objects (e.g. people, cougar, skyscraper, etc.) and lots of variability within each class. All these sources of transformation make an exponential number of different combinations that correspond to all the possible configurations of objects in nature images. Representing objects in images with just one layer of processing would be very inefficient from a computational point of view. Instead, we would like to represent objects in natural images with a factorial and concise representation that is invariant to irrelevant transformation, and hence, more directly correlated to the “real causes” generating the image. Deep Networks are a useful tool because they can (1) re-use intermediate concepts to represent top level and more abstract representations of the input assuming that some “parts” can be shared across different objects, (2) they can be trained efficiently, and (3) they allow fast inference.

These methods have been applied to a variety of datasets and tasks, such as: generic visual object recognition, handwriting recognition, image denoising and compression, text retrieval and classification, and autonomous robot navigation.

I have also been interested in developing the Energy-Based framework for both Supervised and Unsupervised learning. Energy-Based Models are akin to un-normalized factor graphs, and they are useful because they provide a single framework to describe both probabilistic models as well as more discriminative methods, such as maximum margin learning methods and neural networks. The theory of Energy-Based Models gives guide-lines for building learning algorithms that are able to scale to large datasets, and to solve the task efficiently during inference.

Go to the top of the page.

    SPARSE CODING

Sparse coding algorithms learn representations of the input data that have only few components that are significantly non-zero, i.e. that are sparse. A sparse representation can be overcomplete, when its dimensionality is higher than the dimensionality of the input.

For instance, Gabor Wavelets produce a sparse representation of natural images. A sparse coding algorithm is not only able to produce a similar sparse representation, but also, it adapts the analysis and synthesis transformations to the data (for instance, the filter banks used in the decomposition).

Sparse representations are useful because (1) features might become more linearly separable in high dimensional spaces, (2) the sparsity constraint can be used to make training more efficient (see below the section about Energy-Based Models learning), (3) sparse overcomplete representations are more robust to noise, and (4) they are shown to be biologically plausible.

In our work we are interested to learn a non-linear transformation between input and sparse representation. This is a feed-forward mapping that produces a representation very efficiently (e.g. through a learned filter bank followed by a non-linearity), allowing the system to be used in real-time applications. No iterative process is necessary to compute the representation after training, unlike other sparse coding algorithms proposed in the literature. Thanks to the non-linearities we can also construct deep networks by stacking these adaptive transformations on the top of each other to produce hierarchical and more abstract representations.

In order to assess the quality of the representation during training, we require that the input can be reconstructed from the representation as a way to make sure that most of the relevant information has been preserved in the mapping. This feed-back path mapping the representation into input space is used during training only, and it can be simply composed by a set of linear filters. Training the parameters of this model (feed-forward and feed-back mapping) is achieved by minimizing a loss function that can have terms like the reconstruction error of the input, and a sparsity penalty on the representation.
The figure on the right shows an example of the sparse features that are learned by applying the algorithm to natural images. Features are localized oriented edge detectors at different scales. Training these filters from random initial conditions requires a few thousands parameter updates, as shown in the animation on the left.


PAPERS
- Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model (NIPS 2006)
- Sparse Feature Learning for Deep Belief Networks (NIPS 2007)

Go to the top of the page.

LEARNING INVARIANT REPRESENTATIONS

A sparse coding algorithm can be modified to learn representations that are not only sparse, but also invariant to some transformation. Invariance can be learned by adding terms to the loss function that is minimized during training, exploiting the knowledge we have about the transformation, or exploiting general properties like temporal coherence in consecutive video frames or spatial stationarity in static images, for instance. Another possibility is to change the architecture of the system to produce the transformation we want. Here is an example.

Consider the problem of detecting the orientation of bars placed at random locations in a binary image. The feed-forward path has a filter bank (that is learned) detecting bars in each possible orientation. By taking the largest value in the output map produced by each filter detector, we have a representation that is invariant to shifts. In the figure, the first filter is detecting vertical bars. No matter where a vertical bar is located, the largest value of the convolution of the input with this filter will be always the same (hence, the invariance to shifts). Since during training the filters are learned by reconstructing the input from the representation, we have to be able to recover the information about the bar location in order to generate good reconstructions. Hence, we store the position of the largest values in another set of variables, called the "transformation parameters". The feed-back path will be able to reconstruct the input by using not only the sparse and shift invariant representation, but also, the transformation parameters that specify where the features where found in the image.

PAPERS
-
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition (CVPR 2007)

Go to the top of the page.

LEARNING DEEP HIERARCHIES OF FEATURES

The unsupervised learning algorithms that have been previously described can be used as building block to train deep networks in a layer-wise fashion. Once a layer has been trained, it used to produce the input to train the layer above. Since these algorithms are based on general principles like reconstruction to preserve information and sparsity to regularize the representation, they are “replicable”, i.e. they can be used to train each layer of a hierarchical model in sequence.

Using deep models has an important computational advantage. A deep model can re-use intermediate representations to learn more abstract representations that, hopefully, are more easily correlated to the underlying causes generating the data.

For instance, by applying a sparse coding algorithm to handwritten digits we are able to learn feature detectors that resemble the “strokes” that can be used to form the digits (in this figure we show a small selection of these filters). These detectors reveal low-level and fairly localized correlations in the input data.

By using the representation produced by these detectors as input to another sparse coding stage that has only 10 units in the representation, we are actually able to learn the highly non-linear mapping between input pixels and labels in an unsupervised way. The figure below shows what each unit in this second level representation is encoding by mapping it in input space (through the feed-back paths at the second stage and first stage). The second stage detectors are able to put together the “strokes” detected by the first stage, discovering frequent “suspicious” patterns. Such a highly non-linear mapping could not be discovered by using only a single layer of processing.


PAPERS
-
Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model (NIPS 2006)
- Sparse Feature Learning for Deep Belief Networks (NIPS 2007)
- Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition (CVPR 2007)
- Semi-supervised Learning of Compact Document Representations with Deep Networks (ICML 2008)

Go to the top of the page.

    OBJECT RECOGNITION

Deep architectures trained in an unsupervised way to produce sparse and locally shift invariant representations have been applied to a variety of datasets. By using the representation as input to a linear supervised classifier, we have built a system for real-time object recognition.

We currently hold the performance record for recognizing the handwritten digits of the MNIST dataset both using the original training dataset (error rate equal to 0.6%) and the training dataset augmented with elastically distorted digits (error rate equal to 0.4%) (on the latter case, the performance is the same as the one achieved by Patrice Simard ICDAR 2003).

We have also applied these techniques to recognize object categories in the Caltech 101 dataset, that has 101 natural object categories and up to 30 training instances per class. We reported an average accuracy equal to 54%; an example of the accuracy on a few categories is given in the figure below.



PAPERS
-
Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model (NIPS 2006)
- Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition (CVPR 2007)

Go to the top of the page.

IMAGE DENOISING AND COMPRESSION

Sparse coding algorithms can also be used to denoise images. Assuming we know a model for the corruption process (e.g. additive Gaussian noise), then we can train the forward path of the model to map noisy inputs to a sparse representation, and we can force the feed-back path to reconstruct the original un-corrupted input from the sparse representation. After training, the model has learned to denoise images by first mapping the input into a sparse representation, and then, by mapping back the sparse representation into input space through the transformation given by the feed-back path. Since a sparse representation concentrates the energy in a few coefficients, the noise addition to the input causes the representation to change very little in terms of the relative magnitude between active and approximately zero coefficients, yielding good results in denoising.

Our experiments reported state-of-the-art performance in denoising of natural images corrupted by additive Gaussian noise for medium and high levels of noise. The figure below shows an example. The original image is on the left, the corrupted image is in the middle, and the denoised image in on the right.

We also did experiments in text document image compression. After learning sparse representations of document images, a page is compressed by encoding the representation and the location of patches that contain characters. Very promising results were obtained; an example of document image reconstruction is given below (compression ratio equal to 95, percentage of flipped pixels 1.35%).


PAPERS
-
A Unified Energy-Based Framework for Unsupervised Learning (AISTATS 2007)
- A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images (ICDAR 2007)

Go to the top of the page.

TEXT RETRIEVAL AND CLASSIFICATION

We devised a Semi-Supervised algorithm to learn text document representations, and we used these representations for classification and retrieval tasks. Nowadays the most popular document representation is the bag-of-word representation storing the number of times each word of the dictionary appears in a given document. This is conceivably sub-optimal because it assumes that words are independent of each other, disregarding synonyms and polysemous words.

In this work, we have developed an algorithm to train a deep network that extracts representations of documents. Training proceeds by minimizing an objective function that combines both the Unsupervised principle of good reconstruction of the input from the representation, and also, the Supervised information (e.g. the topic) that might accompany the document. The figure on the left shows that a representation that has been learned by combining both objectives outperforms the re-weighted input bag-of-word representation (tf-idf) as well as the representation produced by the same algorithm, but trained totally unsupervised (20 Newsgroup dataset). Also, the performance does not degrade significantly if we compress the representation as we go from the first to the last layer. In this experiment the first layer produces a 200-dimensional representation, while the top layer a representation with just 20 components. If the last layer of the deep network has only two components, then we can map the data into the plane for visualization. The figure on the right shows the learned mapping for the documents of the Ohsumed dataset (medical abstracts). The input 30,000-dimensional space is mapped into the plane in such a way that documents belonging to the same class are clustered together, and similar topics are placed next to each other.



PAPERS
-
Semi-supervised Learning of Compact Document Representations with Deep Networks (ICML 2008)

Go to the top of the page.

ENERGY-BASED MODELS

The Energy-Based Model is a framework for devising efficient learning algorithms. In the supervised case an energy is associated to each pair of input sample and target value. Inference consists of minimizing the energy with respect to the target, and learning consists of defining a loss functional, which is a function of the energy, whose minimization yield the desired behavior, I.e. the energy is lower for the target that corresponds to the true label. Probabilistic models are an instance of Energy-Based Models and they correspond to models that are trained with a particular choice of loss functional, the negative of the log-likelihood, which is often intractable to compute. Energy-Based Models are useful in high-dimensional spaces when the normalization requirement imposed by probabilistic models become computationally intractable. In some sense, Energy-Based Models can be interpreted as a “local probabilistic model”, because they care to assign the correct shape to the distribution of the data only around regions of interest, like those populated by training samples. There are conditions on the energy and the loss functional that guarantee learning and correct inference. Also, there are extensions to the case in which models have latent variables. The figure below shows the energy function of two models trained with different loss functionals. Despite the different shape of energy, the two models achieve the same goal, i.e. they give lower energy to the training samples (the blue dots).


In Unsupervised Learning labels are not provided. Therefore, the energy is a function of the input data, and maybe some other latent variable. The goal of learning is to assign lower energy to samples that are similar to training samples, and higher energy to the other points in input space (or at least to those points that are in a neighborhood of the training samples). This problem can be interpreted as an “energy carving” problem because learning consists of adjusting the parameters of the model to carve the energy surface in such a way that training samples are always on the valleys of the surface (i.e. local minima). In general, it is not sufficient to adjust the parameters in such a way that only the energy of the training samples is decreased because it might well be that the energy surface becomes flat, corresponding to a model that cannot discriminate between points that are similar and points that are dissimilar to the training samples.

Some unsupervised learning algorithms are able to learn good energy surfaces by using a loss functional that has two terms. The first one decreases the energy of the training samples, while the second one increases the energy of a properly chosen point in input space. However, it might be difficult and computationally expensive to individuate a good candidate for pulling up the energy, overall in high-dimensional spaces. Other methods achieves the same goal of energy carving by replacing this explicit pull-up term in the loss, with constraints in the representation, such as sparsity for instance. By making sure that the internal representation of the model can carry only a limited amount of information, the energy surface can take the right shape without using a contrastive term in the loss functional. This can yield to the design of far more efficient unsupervised learning algorithms.

For instance, the figure below shows the energy surface (in this case the energy is the reconstruction error) produced by different algorithms after training on a dataset of points sampled from a spiral (the points in magenta). We can see the energy surface produced by A) PCA, B) K-Means, C) 2-1-2 auto-encoder neural net, D) 2-20-2 auto-encoder neural net trained by maximum likelihood, E) 2-20-2 auto-encoder neural net trained my maximum margin loss, and F) a 2-20-2 network trained with a sparse coding algorithm. All these algorithms are able to avoid flat energy surfaces, and the apparently very different shapes of these energies can be understood under the principle described above of explicit or implicit energy pull-up.


PAPERS
-
A Tutorial on Energy-Based Learning
- A Unified Energy-Based Framework for Unsupervised Learning (AISTATS 2007)

Go to the top of the page.

AUTONOMOUS ROBOT NAVIGATION

The above mentioned unsupervised learning algorithms have been used also to extract visual features for the DARPA project called LAGR (learning applied to ground robots). Features have been used to do long range stereo helping the robot navigate in unknown outdoor environments.



Go to the top of the page.

AUTOMATIC RECOGNITION OF BIOLOGICAL PARTICLES

While I was at Caltech, I worked on a system for automatic recognition of biological particles in microscopic images, such as pollen grains and cells in human specimen. The system is composed of (1) a detector, and (2) a classifier assigning a particles to its category. The figure shows the tool that was used to collect the pollen data. After acquiring an image from the microscope, an expert had to detect and label the pollen grains that were subsequently used to train the algorithm for automatic recognition. In this image, there are two Chinese elm pollen grains, the other particles are just debris.


PAPERS
-
Automatic Recognition of Biological Particles in Microscopic Images

Go to the top of the page.