CSC2515, Fall 2004: Some project ideas
--------------------------------------

Regression & Classification Algorithms

  -- Compare a gaussian prior on weights with adding noise to training
     data. For example, compare using a quadratic ridge penalty on the
     weights in logistic regression to another way of regularizing it,
     namely to add an  "epsilon ghost" of each training case, which is
     a new case with a very small weight, the same inputs and the
     OPPOSITE output. 
  -- Compare the performance of lasso, leaps-and-bounds for variable subset
     selection (Furnival-Wilson) and regular ridge regression 
     (with the ridge penalty chosen using  cross validation) on 
     several regression benchmarks.
  -- Come up with a way to do variable selection when there are too
     many dimensions to use leaps and bounds efficiently. 
     Try out your new method and compare to lasso.
  -- Investigate a "kernel lasso".
  -- Apply the lasso trick (absolute value penalty on the weights)
     to classification in some new way.

Unsupervised Learning

  -- Implement mixtures of fully observed trees and
     compare them to a single fully observed tree on 
     modeling some data.
  -- Investigate a mixture of factor analyzers type model in which there
     are a total of M factors, each one a vector of length
     equal to the dimension of the observations. However, for each
     training case add a "lasso-type" regularization penalty on the 
     coefficients used to model that training case. This will force
     the model to set many latent variables to zero when explaining
     any given data case. Derive the inference rule for this model,
     and possibly (if you are keen) the EM learning algorithm.
     Apply it to at least one simple example (e.g. handwritten digits).
     This is an interesting example of sparsifying factor analysis, which
     I don't think anyone has looked at before.
  -- Implement "unobserved naive bayes" which assumes that conditioned
     on a discrete, unobserved cluster variable all the observed
     variables are conditionally independent. Apply it to some
     clustering problems.


Optimization:

  --  Quadratic Programming: 
      Implement either the multiplicative updates described in
      F. Sha, L. K. Saul, and D. D. Lee (2002).
      "Multiplicative updates for nonnegative quadratic programming in
      support vector machines."
      Technical Report MS-CIS-02-19, University of Pennsylvania
      or the LARS algorithm described in
      Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2003). 
      Least angle regression, Annals of Statistics
      and compare these to at least one other way fast way of solving 
      the QP problem for support vector machines and Lasso 
      (e.g. SMO, or "Simple SVM").
      If you want to do more theoretical work you also could derive the
      extension of the updates in the case where the data is not
      linearly separable in the feature space. 
  --  Implement a "limited memory BFGS" optimizer, and compare it to
      gradient descent with momentum and to conjugate gradients
      on two fronts: (a) the speed at which it
      minimizes the training error. It should crush gradient descent
      on speed, and might be a bit faster than CG.
      (b) the generalization quality of the local minima that it
      finds. Here you may discover that it is wildly overfitting
      and loses horribly to gradient descent, maybe losing mildly to CG?
  --  Investigate the "split-and-merge" heuristic for improving
      the local minimum found by K-Means and compare it to the
      furthest first heuristic and to random initializations.


Applications

  -- Train a logistic regression network to distinguish spam
     from non-spam email. There is some example spam data on the web
     in various places, including the web site for the "Elements of
     Statistical Learning" book.
  -- Do vector quantization on a few adjacent columns of spectrograms
     made from clean speech of various speakers. Train one VQ per
     person, and use these to do speaker identification by
     seeing which VQ best reconstructs a test spectrogram.
     See me if you want some multi-person speech data.
  -- Train a simple HMM on character sequences from text.
     Train one HMM per language and use them to do language
     identification.
  -- Use a "profile HMM" to perform multiple alignment on some
     discrete sequences.