2013Fall7646 Project 1

From Quantwiki
Jump to: navigation, search

Important Updates

You are allowed to use NumPy libraries for the linear regression component of this project.

You are NOT allowed to use other peoples' code to implement KNN. You must write that yourself.

Overview

You are to implement and evaluate a KNN learner class, and a Linear Regression class in Python named KNNLearner and LinRegLearner respectively.

Your class should implement the following functions/methods for the KNN learner:

learner = KNNLearner(k = 3)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

Where "k" is the number of nearest neighbors to find. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values. The columns are the features and the rows are the individual example instances. Y and Ytrain are single dimensional lists that indicate the value we are attempting to predict with X. For an example of how to code an object, take a look at QSTK/qstklearn/kdtknn.py

Similarly for the Linear Regression learner:

learner = LinRegLearner()
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

We are considering this a regression problem (not classification). So the goal is to return a continuous numerical result (not a discrete numerical result).

Some external resources that might be useful for this project:

The Data

You will find these two files in the Examples/KNN directory:

  • data-classification-prob.csv
  • data-ripple-prob.csv

Each data file contains 3 columns: X1, X2, and Y. Please don't be misled by the name "data-classification-prob.csv" in to treating the problem as classification. It's regression.

The columns X1,X2 and Y in data-classification-prob.csv represent one set of paired data and data-ripple-prob.csv represents another set. One intent of the assignment is for us to assess how well the learning algorithm works on these different data sets. In each case you should use the first 60% of the data for training, and the second 40% for testing.

Software to write

  • Create a python object called KNNLearner in a file named KNNLearner.py that implements the methods described above.
  • Create a python object called LinRegLearner in a file named LinRegLearner.py that implements the methods described above.
  • Create a separate python program called testlearner.py that evaluates your each of your learners in the following manner:
    • Selects the first 60% of the data for training (e.g., feed to addEvidence().
    • Use the remaining 40% for testing (e.g., query).
    • testlearner.py should evaluate the following for each learner:
      • Time required for training (average seconds per instance)
      • Time required for query (average seconds per instance)
      • RMS error (root mean square difference between "correct" Y and the result provided by query()). See http://en.wikipedia.org/wiki/Root_mean_square for example of how to calculate RMS error.
      • Correlation coefficient of the response from the learner versus the correct response (using the 40% out of sample data)

Experiments to run and charts to create

For the KNN Learner:

  • Vary K from 1 to 50
  • For each data set create a separate graph that reports K (as the horizontal axis) versus:
    • avg train time per instance,
    • avg query time per instance,
    • correlation
    • RMS error
  • Create a chart for each data set (using K for the horizontal axis) that compares in-sample error versus out-of sample error on the same chart (2 charts)
  • Scatter plots for each experiment that show predicted Y versus actual Y for for the "best" K using the out-of-sample data (2 charts).

For the LinReg Learner:

  • Average train time per instance (just a single number, not a chart)
  • Average query time per instance (just a single number, not a chart)

Note that there are 12 charts in total to create. It is OK if you choose to group some of the plots together, for instance to combine the charts for train time for both data sets. What is important is that you report all of the data.

Deliverables

Submit files (attachments) via t-square

  • Your code in KNNLearner.py, LinRegLearner.py and testlearner.py
  • A SINGLE Report (in a pdf file, report.pdf):
    • Include the 12 charts, and the data for LinReg required above.
    • Answer the following questions:
      • What is the "best" K for each dataset? Explain your reasoning. Note that there is not necessarily a single correct answer. I want to see your reasoning.
      • As K decreases, does overfitting occur for the datasets? At approximately which K does it start? Explain why you think this is occurring (or that it is not occurring).
  • Important: Disclose and cite any code or ideas you drew from others.

How to submit

Go to the t-square site for the class, then click on the "assignments" tab. Click on "add attachment" to add your 4 files. Once you are sure you've added the files, click "submit."

Hints

For the linear regression component, you can use numpy libraries, or other libraries as you wish. We suggest numpy.linalg.lstsq (see http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html).

In order to get a correct answer that includes the constant term (alpha) you need to append a column of 1s to your X matrix before you send it to lstsq.

Extra Credit

Write additional code, and add plots to your report that do the following:

  • Revise the program Examples/Basic/plot2ddata.py to query the learner from -1 to 1 in steps of .001 in each dimension (1 million queries) and plot the learned model for each dataset.
  • Write code to view the original data and the learned model in 3D.

Rubric

Start with 100. Points off as follows:

  • KNNLearner.py missing -50
  • LinRegLearner.py missing -10
  • testlearner.py missing -10
  • report.pdf missing -50
  • are all 12 charts/data series present? (-10 for each missing data series)
  • are charts approximately correct? (-5 for each error)
  • Answer to "best K" question: Up to 10 points off if completely wrong
  • Answer to "over fitting" question: Up to 10 points off if completely wrong
  • If the report indicates significant problems, check the KNN implementation, and:
    • KNN algorithm marginally incorrect -10
    • KNN algorithm significantly incorrect -30

Extra credit:

  • Part 1: Up to +2.5 points
  • Part 2: Up to +2.5 points

To get full extra credit, execution must be stellar.