2013Fall7646 Project 3

From Quantwiki
Jump to: navigation, search

Overview: Build a Stock Price Forcaster

In this project you will create and train a learner to forecast 5 day future returns of stock prices. Whoo Hoo! The main difference between this project and the other two machine learning projects is that we will be using the learner to predict time series data -- stock prices are time series data. In order to make this homework a little more satisfying we're going to start with some manufactured data that is more predictable than real stock data. Once you cut your teeth on that data, I encourage you to then apply your algorithm to real stock data to see how well it works (you can do that for extra credit if you like).

Summary of the task for you:

  • Create a learning system that reads many example files of historical time series data.
  • Build a model from that data that predicts 5 day returns (i.e. the price of a stock 5 days from the present).
  • Test the model on two separate data files.
  • Plot and analyze the results.

Aspects of the project that are up to you:

  • Which learner you would like to use. But you must use one of KNN, Random Forests, or LinReg.
  • How many features / technical indicators to use.
  • How you formulate the problem for the learner. In other words you choose what are values for X1, X2, X3, etc., and Y. (Some suggestions are given in class).

Aspects of the project that are not up to you:

  • Choice of training and test data. You must use the data sets we specify.
  • Method of assessing your learner. (Defined below).

The Data

The data is formatted in the exact same way as the historical stock price data we have used previously.

alt Examples
Date	Open	High	Low	Close	Volume	Adj Close
9/12/12	203.52	204.65	202.96	203.77	3284000	203.77
9/11/12	200.55	203.46	200.51	203.27	3910600	203.27
9/10/12	199.39	201.82	198.73	200.95	4208000	200.95
9/7/12	199.12	199.5	198.08	199.5	3413700	199.5
9/6/12	196.26	199.46	196.11	199.1	3931700	199.1

You will be provided 400 such files with names ranging from: ML4T-000.csv to ML4T-399.csv. You can deposit them directly in your QSTK data directory and read them using the QSTK data access methods.

We will use the first 100 files, 000 to 099 for training, and the other files for testing.

The data is available here: http://www.quantsoftware.org/CompInvestI-Files/proj3-data.zip

Recommended Structure of Your Program

I recommend that your program follow this outline (pseudo code):

Model building part: Read the training data. The data is formatted in the same way as the stock data, so you can drop the CSV files in the stock data directory and read them in the same way. Build training examples/tuples (i.e., <X1, X2, X3, … Y>) from the data.

  • for file in training files
    • for today in dates
      • use data from today and before only to compute X1, X2, X3, …
      • Y = price[today + 5]
      • record tuple <X1, X2, X3, … Y>
  • learner.addEvidence(X, Y)

Prediction and assessment part: Read the specific file you are to predict and test your learner

  • read in the file
  • for today in dates
    • use data from today and before only to compute X1, X2, X3, …
    • Ypredict = learner.query(X)
    • Yactual = price[today+5]
    • record <Ypredict, Yactual> for later assessment

You may find the example code in QSTK/Examples/Features/featuretest.py useful for hints on aspects of this project.

Training Data and Testing Data

Use files 000 to 099 for training.

You will use two separate files for out of sample testing:

  • ML4T-292.csv (everyone will use this one, and we are figuring it out at the moment)
  • ML4T-1XX.csv (where you calculate XX as described below).

XX = number of the first letter of your first name + number of the first letter of your last name

A = 1, B = 2, …, Z = 26

So, for instance, my number would be T + B = 100 + 20 + 2 = 122

To Plot and Calculate

For each set of test data (you will have 2), create these charts:

  • Chart 1: Shows the first 100 days of Yactual and Ypredict
  • Chart 2: Shows the last 100 days of Yactual and Ypredict
  • Chart 3: Scatterplot of Ypredict versus Yactual (one for each data set)
  • Chart 4: Shows the values of your first 5 features for the first 100 days all on the same chart

Please chart the Yactual in blue, and Ypredict in red.

Calculate the average error (RMS) and correlation for each of the data sets (a total of 4 numbers).


Submit these files:

  • forecaster.py Python program that produces the output for the charts listed above.
  • report.pdf a PDF including:
    • Text describing the method you used for learning, to include indicators you created and the learning method you used.
    • The 8 charts you created
    • Why you think your method worked well (or did not)

Extra Credit

  • Assess your learner more deeply (scatter plots, correlation, etc.)
  • Create a parameterized model (instead of the instance based that we used). Compare the results of this model to the instance based approach.
  • Try it on real stock data.
  • Try it with different indicators.
  • Try it on real stock data and generate trades that you test in your market simulator.
  • Learn while running on the test set.

How to submit

Go to the t-square site for the class, then click on the "assignments" tab. Click on "add attachment" to add your files. Once you are sure you've added the files, click "submit."


I will create a blog post in which I will list the names of the 10 best performers in this final project!


The project includes 3 main components: 70% for code and report, 10% for correlation with your dataset, 20% for correlation with the global dataset.

Part 1: 80% of the project:

  • forecaster.py missing -50
  • report.pdf missing -50
  • are all 8 charts/data series present? (-20 for each missing data series)
  • are charts approximately correct? (-5 for each error)
    • note that Ypredict should not appear in the "first 100 days" chart until the look back period is complete
  • "Text describing the method you used for learning, to include indicators you created and the learning method you used." Up to 20 points off if completely wrong
  • "Why you think your method worked well (or not)": Up to 10 points off if completely wrong

Part 2: 10% of the project

  • 10 points if correlation is > 0.3
  • 5 points if correlation is > 0.1
  • 0 points otherwise

Part 3: 20% of the project

  • 20 * (correlation + 0.1). So a correlation of .9 or better gets full credit.

Extra credit:

  • Up to +5 points
  • To get full extra credit, execution must be stellar.