Homework #1: Linear Models

The purpose of this assignment is to implement and understand
the basic learning algorithms for linear classifiers.

Your assignment should be sent by email to the TA
- Send your email in plain text (no msword, no html, no postscript, no pdf).
- late submissions will be penalized.
- you must implement the code by yourself, but you are 
  encouraged to discuss your results with other students.
- Include your source code as attachment(s).


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
The following learning algorithms must be implemented
with regularization:
- the perceptron algorithm
- direct solution of linear regression
- on-line (stochastic) linear regression
- on-line (stochastic) logistic regression
- regularized loss function.

Two datasets are provided with which to test the algorithms.
Both sets are two-class classification problems from the 
UC Irvine machine learning dataset repository
(http://www.ics.uci.edu/~mlearn/MLRepository.html):
- pima-indians-diabetes: 768 samples, 8 variables.
    prediction of diabetes in adult women of Pima indian heritage
    (see pima-indians-diabetes.names for details).
- spambase: 4601 samples, 57 variables.
    Email spam filtering.
    each sample is the description of an email message
    (presence of certain words, frequency of uppercase characters...)
    and the variable to predict is whether the message is
    spam or is legitimate. (see spambase.readme for details).

The data is available in two formats:
1 - the original UCI format (comma-separated values)
    in the files pima-indians-diabetes.data and spambase.data
2 - A Lush-readable format (a list of lists)
    in the files pima-indians-diabetes.ldat and spambase.ldat

The Lush code to read the data, format it, and a skeleton
of the learning code is provided in Lush. Using this code is
not required, but will considerably shorten the amount of time
you will have to spend on this assignment. If you don't want to
use Lush you can use C/C++, Java, Python, or Matlab. If you want
to use another language, ask the TA for permission. Please note that
you will need some sort of library function for linear system solving.

If you choose to use Lush, you simply need to edit the file
"linear-classifier.lsh" and implement the "energy" and "learn-sample"
methods of the classes perceptron, logistic-reg, and linear-reg, and
the "mse-solve" method of linear-reg.  Then edit and modify the script
homework-01.lsh to run the learning experiments.


1 - implement:
    a - the perceptron learning algorithm (perceptron class)
        methods to implement: energy, loss, learn-sample 
    b - the linear regression with quadratic loss (linear-reg class)
        methods to implement: energy, loss, learn-sample, mse-solve
        learn-sample is the online/stochastic gradient descent version
        while mse-solve is the direct solution through a linear system solver.
    c -the logistic regression algorithm (logistic-reg class)
        methods to implement: energy, loss, learn-sample 

    ==> include your code as attachment.

2 - Experiments with the Pima database
    (A) - IF YOU DO NOT USE LUSH, you must massage the data before
        using it as described below. If you use Lush you do not
        need to do anything, since the Lush code provided here 
        does that automatically.
        - transform the desired values of the output variables
          so that it takes values -1 for one class and +1 for
          the other class.
        - shuffle the examples in a random order.
        - transform the data so that each input variable has zero
          mean and unit variance.

    (B) - train each algorithm with
          8, 16, 32, 64, 128, 256, and 512 training samples,
          and 256 test samples. Use the first examples as training,
          and the last 256 as testing (Lush code provided).
        - The linear regression and logistic regression
          algorithms require you to find good values
          for the learning rate (the step size, eta).
        - find which value of the learning rate will cause
          divergence for each size of the training set.
        - find the learning rate values for fastest convergence.
        - implement a stopping criterion that detects convergence.

      ==> for each size of training set, provide:
        - the final value of the average energy and classification error
          on the training set and the test set
        - the value of the learning rate used
        - the number of iterations performed
        - what is the asymptotic value of the training/test error 
          for very large training sets?

3 - Modify your code for linear regression and logistic regression 
    to add the "ridge regression" regularizer: lambda*||W||^2  
    to the cost function.

    ==> can you improve the performance on the test set for
        small sizes of the training set by varying the 
        value of lambda? If yes, provide the size of the
        training set and the value of lambda for which
        you get an improvement.

4 - repeat some of the experiment in question 2, but DO NOT normalize
    the input variables (do not set the mean to zero and variance
    to 1, just use the raw values). Simply remove the 
    line   (==> data normalize) from the example provided.
    - train with 256 samples, and test with 256 samples.

    ==> provide:
      - the final value of the average energy and classification error
        on the training set and the test set
      - the value of the learning rate used
      - the number of iterations performed
      - why is the performance so much worse, the convergence
        so much slower, and the learning rate so much smaller?
    

5 - Repeat experiments of question 2 with the spambase dataset.
    - train each algorithm with
      8, 16, 32, 64, 128, 256, 512, 1024, and 2048 training samples,
      and 2000 test samples. Use the first examples as training,
      and the last examples as testing (Lush code provided).
    
    ==> for each size of training set, provide:
      - the final value of the average energy and classification error
        on the training set and the test set
      - the value of the learning rate used
      - the number of iterations performed
      - what is the asymptotic value of the training/test error 
        for very large training sets?
      - can you improve the performance on the test set for
        small sizes of the training set by varying the 
        coefficient of the ridge regularizer? If yes, provide the 
        size of the training set and the value of lambda for which
        you get an improvement.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++