2014Fall7646 Project 4

From Quantwiki
Jump to: navigation, search

Overview: Build a Stock Price Forcaster

In this project you will create and train a learner to forecast 5 day future returns of stock prices. Whoo Hoo! The main difference between this project and the other machine learning projects is that we will be using the learner to predict time series data -- stock prices are time series data. In order to make this homework a little more satisfying we're going to start with some manufactured data that is more predictable than real stock data. Once you cut your teeth on that data, for extra credit you can apply your algorithm to real stock data to see how well it works.

Summary of the task for you:

  • Create a learning system that reads many example files of historical time series data.
  • Build a model from that data that predicts 5 day returns (i.e. the price of a stock 5 days from the present).
  • Test the model on two separate data files.
  • Plot and analyze the results.

Aspects of the project that are up to you:

  • Which learner you would like to use. But you must use one of KNN, Random Forests, or LinReg.
  • How many features / technical indicators to use.
  • How you formulate the problem for the learner. In other words you choose what are values for X1, X2, X3, etc., and Y. (Some suggestions are given in class).

Aspects of the project that are not up to you:

  • Choice of training and test data. You must use the data sets we specify. (For instance, we are not allowing parameterized models -- explained in class).
  • Method of assessing your learner. (Defined below).

The Data

You will be provided 400 such files with names ranging from: ML4T-000.csv to ML4T-399.csv. You can deposit them directly in your QSTK data directory and read them using the QSTK data access methods. The data is formatted in the exact same way as the historical stock price data we have used previously.

We're going to split the data up in two ways: 1) Which symbols to use, 2) Which dates to use.

  • For training, use:
    • the first 200 files, 000 to 199, named ML4T-000.csv to MLT-199.csv
    • dates: Jan 1, 2001 to Dec 31, 2005.
  • For testing use:
    • MLT-292.csv, and MLT-3XX.csv (where XX is defined below).
    • Dates Jan 1 2006 to Dec 31 2007

For the second testing file, XX = number of the first letter of your first name + number of the first letter of your last name, where

A = 1, B = 2, …, Z = 26

So, for instance, my number would be T + B = 100 + 20 + 2 = 322


alt Examples
Date	Open	High	Low	Close	Volume	Adj Close
9/12/12	203.52	204.65	202.96	203.77	3284000	203.77
9/11/12	200.55	203.46	200.51	203.27	3910600	203.27
9/10/12	199.39	201.82	198.73	200.95	4208000	200.95
9/7/12	199.12	199.5	198.08	199.5	3413700	199.5
9/6/12	196.26	199.46	196.11	199.1	3931700	199.1

The data is available here: http://www.quantsoftware.org/CompInvestI-Files/proj3-data.zip

Recommended Structure of Your Program

I recommend that your program follow this outline (pseudo code):

Model building part: Read the training data. The data is formatted in the same way as the stock data, so you can drop the CSV files in the stock data directory and read them in the same way. Build training examples/tuples (i.e., <X1, X2, X3, … Y>) from the data.

  • for file in training files
    • for today in dates
      • use data from today and before only to compute X1, X2, X3, …
      • Y = price[today + 5]
      • record tuple <X1, X2, X3, … Y>
  • learner.addEvidence(X, Y)

Prediction and assessment part: Read the specific file you are to predict and test your learner

  • read in the file
  • for today in dates
    • use data from today and before only to compute X1, X2, X3, …
    • Ypredict = learner.query(X)
    • Yactual = price[today+5]
    • record <Ypredict, Yactual> for later assessment

You may find the example code in QSTK/Examples/Features/featuretest.py useful for hints on aspects of this project.

To Plot and Calculate

For each set of test data (you will have 2), create these charts:

  • Chart 1: Shows a time series plot of the first 200 days of Yactual and Ypredict (one for each data set)
  • Chart 2: Shows a time series plot of the last 200 days of Yactual and Ypredict (one for each data set)
  • Chart 3: Scatterplot of Ypredict versus Yactual (one for each data set)
  • Chart 4: Shows the values of your first 5 features for the first 200 days all on the same chart (for ML4T-292.csv). If you use less than 5 features, just show the ones you use.

Please chart the Yactual in blue, and Ypredict in red.

Calculate the average error (RMS) and correlation for each of the data sets (a total of 4 numbers).

Deliverables

Submit these files:

  • forecaster.py Python program that produces the output for the charts listed above.
  • report.pdf a PDF including:
    • Text describing the method you used for learning, to include indicators you created and the learning method you used.
    • The 7 charts you created
    • The numerical results listed above
    • Why you think your method worked well (or did not)

How to submit

Go to the t-square site for the class, then click on the "assignments" tab. Click on "add attachment" to add your files. Once you are sure you've added the files, click "submit."

Rubric

Scoring the project includes 3 main components:

Part 1: 80% of the project:

  • forecaster.py missing -50
  • report.pdf missing -50
  • are all 7 charts/data series present? (-20 for each missing data series)
  • are charts approximately correct? (-5 for each error)
    • note that Ypredict should not appear in the "first 100 days" chart until the look back period is complete
  • "Text describing the method you used for learning, to include indicators you created and the learning method you used." Up to 20 points off if completely wrong
  • "Why you think your method worked well (or not)": Up to 10 points off if completely wrong

Part 2: 10% of the project, correlation with ML4T-292.csv

  • 10 * (correlation + 0.1). So a correlation of .9 or better gets full credit.

Part 3: 10% of the project, correlation with MLT-3XX.csv

  • 10 * (correlation + 0.1). So a correlation of .9 or better gets full credit.

Extra credit:

  • Up to +10 points (see below)

Extra Credit

  • Use the forecaster as part of a trading strategy. Have the forecaster output an orders.csv file, then run that file through your market simulator.
  • Repeat the above, but do it with stock price data instead of our artificial data. You will probably have to create some of your own indicators.
  • Assess your learner more deeply (scatter plots, correlation, etc.)
  • Create a parameterized model (instead of the instance based that we used). Compare the results of this model to the instance based approach.