2011Fall7646 Project 2

From Quantwiki
Jump to: navigation, search

Overview

You are to implement and evaluate a KNN learner class in Python named KNNLearner.

Your class should implement the following functions/methods:

learner = KNNLearner(k = 3, method = "mean")
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

Where "k" is the number of nearest neighbors to find and "method" is one of "mean" or "median" for selecting how to report an answer. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values. The columns are the features and the rows are the individual example instances. Y and Ytrain are single dimensional lists that indicate the value we are attempting to predict with X. For an example of how to code an object, take a look at QSTK/qstklearn/kdtknn.py

We are considering this a regression problem (not classification). So the goal is to return a continuous numerical result (not a discrete numerical result).

Some external resources that might be useful for this project:

The Data

You will find these two files in the Examples/KNN directory:

  • data-classification-prob.csv
  • data-ripple-prob.csv

Each data file contains 3 columns: X1, X2, and Y. Please don't be misled by the name "data-classification-prob.csv" in to treating the problem as classification. It's regression.

The columns X1,X2 and Y in data-classification-prog.csv represent one set of paired data and data-ripple-prob.csv represents another set. One intent of the assignment is for us to assess how well the learning algorithm works on these different data sets. In each case you should use the first 60% of the data for training, and the second 40% for testing.

Software to write

  • Create a python object called KNNLearner in a file named KNNLearner.py that implements the methods described above.
  • Create a separate python program called testlearner.py that evaluates your KNNLearner in the following manner:
    • Selects the first 60% of the data for training (e.g., feed to addEvidence().
    • Use the remaining 40% for testing (e.g., query).
    • testlearner.py should evaluate the following:
      • Time required for training (average seconds per instance)
      • Time required for query (average seconds per instance)
      • Correlation coefficient of the response from the learner versus the correct response (using the 40% out of sample data)

Experiments to run

  • Run your test on both data sets and report: avg train time per instance, avg query time per instance, correlation for each data set. USE K=3, and method = "mean" for this evaluation.
  • For each data set test values of K from K=1 to ? the find the value of k that provides the best correlation coefficient.
  • Plot correlation versus K for each data set and report K and the correlation for the best solution.

Deliverables

Submit files (attachments) via t-square

  • Your code in KNNLearner.py and testlearner.py
  • Plot the correlation coefficient for data-classification-prob.csv and data-ripple-prob.csv, as you vary K from 1 to a large number: plot1.pdf and plot2.pdf
  • Report (in a pdf file, report.pdf): The best K for each dataset (the one that minimizes error) and the corrcoef for each dataset, and the timing results (using K=3 and "mean") in report.pdf
  • Disclose and cite any code or ideas you drew from others.

How to submit

Go to the t-square site for the class, then click on the "assignments" tab. Click on "add attachment" to add your 4 files. Once you are sure you've added the files, click "submit."