2013Fall7646 Project 2

From Quantwiki
Jump to: navigation, search

Important Updates

You are allowed to use NumPy libraries for the linear regression component of this project.

You are NOT allowed to use other peoples code to implement Random Forests. You must write that yourself.

Overview

You are to implement and evaluate a Random Forest learner class, (similar interface as for your linear regression and KNN learners). Your implementation should be based on the method described by Cutler (and by Tucker in class). See the link to the paper below.

Your class should implement the following functions/methods for the Random Forest learner:

learner = RandomForestLearner(k = 3)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

Where "k" is the number of trees to create in your forest. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values. The columns are the features and the rows are the individual example instances. Y and Ytrain are single dimensional lists that indicate the value we are attempting to predict with X. For an example of how to code an object, take a look at QSTK/qstklearn/kdtknn.py

We are considering this a regression problem (not classification). So the goal is to return a continuous numerical result (not a discrete numerical result).

Some external resources that might be useful for this project:

Simplifications & Clarifications

For reference/discussion , assume that Xij is the value of feature i at row j of the data, and Yj is the target value to learn. So our data looks like:

X11 X21 X31 Y1
X12 X22 X32 Y2
X13 X23 X33 Y3
...

In order to make the project a bit easier, we will make the following simplifications to Cutler's algorithm:

  • For choosing feature to split on: Select randomly
  • For choosing SplitVal: Choose two random rows in the data (two separate j's) then take the mean of the two Xs. As an example, suppose we're splitting on feature 2, and our random rows are 6 and 8. SplitVal = (X26 + X28) / 2
  • If you encounter data where all the values for a particular feature are the same (therefore can't split the data) convert the node to be a leaf node with the value being the mean of the Ys.
  • When you train, you should send 60% of the data to addEvidence() for the random forest.
  • "Bagging" is not a required element of this project. It has been moved to extra credit.
  • Pruning is not a required element of this project.
  • As other questions arise, I will add answers here -- (Tucker)

Recommendations on How Implement This

First create a single function/method that reads a tree in from a CSV file, and then allows you to query the tree. Make sure you have this component rock solid before you start work on the other components. Use the representation for trees that I have discussed in class: Four columns where the columns represent:

  • Feature number (0 and greater, or -1 if it is a leaf)
  • SplitVal (or leaf value if it is a leaf)
  • Index/pointer to root of left tree
  • Index/pointer to root of right tree

Next, build a function that can learn a tree from data. Feed it some very simple examples and make sure it generates correct trees.

Next, build on these components to create a learner that learns multiple trees (a forest).

Your addEvidence() function should train multiple trees (K of them in fact); each tree with a random selection of 60% of the data.

The Data

We will use the same data as in the KNN assignment. You will find these two files in the Examples/KNN directory:

  • data-classification-prob.csv
  • data-ripple-prob.csv

Each data file contains 3 columns: X1, X2, and Y. Please don't be misled by the name "data-classification-prob.csv" in to treating the problem as classification. It's regression.

The columns X1,X2 and Y in data-classification-prog.csv represent one set of paired data and data-ripple-prob.csv represents another set. One intent of the assignment is for us to assess how well the learning algorithm works on these different data sets. In each case you should use the first 60% of the data for training, and the second 40% for testing.

Software to write

  • Create a python object called RandomForestLearner in a file named RandomForestLearner.py that implements the methods described above.
  • Reuse your LinRegLearner and KNN learner.
  • Reuse your python program called testlearner.py that evaluates your each of your learners in the following manner:
    • Selects the first 60% of the data for training (e.g., feed to addEvidence().
    • Use the remaining 40% for testing (e.g., query).
    • testlearner.py should evaluate the following for each learner:
      • RMS error (root mean square difference between "correct" Y and the result provided by query()). See http://en.wikipedia.org/wiki/Root_mean_square for example of how to calculate RMS error.
      • Correlation coefficient of the response from the learner versus the correct response (using the 40% out of sample data)

Experiments to run

You should create four charts where you vary K from 1 to 100. K is just K for KNN, and K is the number of trees for the forest learner. Each chart should include two lines (one for KNN, the other for RF).

  • RMS error for data set 1
  • Correlation for data set 1
  • RMS error for data set 2
  • Correlation for data set 2

To turn in

  • Your report in the file report.pdf. In this report you should discuss and explain the following:
    • Did you see improved performance using more trees in one data set or the other (or both)?
    • If there was a difference, explain why you think the improvement is better for one data set.
    • Now that you have compared KNN, linear regression and Random Forests, which approach do you think is best, and why?
  • Your code in RandomForestLearner.py

Extra Credit

Write additional code to implement the following additional features for Random Forests. You can choose to do some of these, none, or all.

  • For each of the below, you should implement the feature, then assess its value compared to Random Forests without the feature. Create a chart of of k (trees) versus correlation as metric. Note that the benefits for some of the features will not be seen until the number of trees becomes larger.
    • Implement Linear Regression learners at the leaves.
    • Implement "bagging"
    • Implement "boosting"
    • Devise and implement some other method of improving performance (e.g., pruning).
    • Add a parameter to the constructor called leafsize that causes that many elements to be combined into the leaves.

Rubric

  • RandomForestLearner.py missing -50
  • report.pdf missing -50
  • are all 4 charts/data series present? (-20 for each missing data series)
  • are charts approximately correct? (-5 for each error)
  • Answer to "do you see improvement?" question: Up to 10 points off if completely wrong
  • Answer to "if it was better, why?" question: Up to 10 points off if completely wrong
  • Answer to "answer to whir is best?" question: Up to 10 points off if completely wrong
  • If significant problems, check the KNN implementation, and:
    • RF algorithm marginally incorrect -10
    • RF algorithm significantly incorrect -30
  • Extra credit:
    • Up to +5 points
    • To get full extra credit, execution must be stellar.