2011Fall7646 Project 4

From Quantwiki
Jump to: navigation, search


You are to implement two learning systems for classifying news articles about stocks. You will be provided examples of good articles and bad articles in the sense that the news reported in the articles imply an increase (good) or reduction (bad) in value (price) for the associated company.

We will proscribe details of the first system you should implement -- it will be simple, and possibly poorly performing. You are then to modify that learner in some way that you believe will improve its performance. The modification can be as simple as changing the method scoring of articles (one or two lines of code) or as complex as implementing a new method.

The deliverable for this project is the code that implements the above classifiers (the proscribed one as classifynews1.py, and your "improved" version as classifynews2.py), a report (report.pdf) that describes your approach and results.

The student who creates the best classifier will win a prize!

Command Line and Output

Your program should run according to the following example command line:

python classifynews1.py goodlist.txt badlist.txt testlist.txt

Output should look like this:

 file: goodnews/09130912381.txt
 class: good

 file: badnews/81727321801.txt
 class: good

You should score your classifier as number correct / total articles tested. So, for instance, if your system correctly scored 5 of 7 articles, the score is 5/7.

The Data

You will be provided two directories that contain news articles: goodnews and badnews. Each directory contains a set of .txt files that represent news articles. These are the articles that you classified in one of your homework assignments. There are additionally some articles that have been classified by the TAs and the instructor.

For the first experiments, the good news articles are located in /hzr71/research/QSDataLabel/goodnews and the bad news files are located in /hzr71/research/QSDataLabel/badnews. For the second experiment, all articles are located in /hzr71/research/QSDataLabel/newscred

You will additionally be provided the following 6 files that define two experiments you should conduct:

  • goodlist1.txt: A list of "good news" files scored by the class.
  • badlist1.txt: A list of "bad news" files scored by the class.
  • testlist1.txt: A list of out of sample files scored by the class.
  • goodlist2.txt: A list of "good news" files scored by the TAs and instructor.
  • badlist2.txt: A list of "bad news" files scored by the TAs and instructor.
  • testlist2.txt: A list of out of sample files scored by the TAs and instructor.

Experiments & Report

All together then, you will need to run 4 experiments:

  1. python classifynews1.py goodlist1.txt badlist1.txt testlist1.txt
  2. python classifynews1.py goodlist2.txt badlist2.txt testlist2.txt
  3. python classifynews2.py goodlist1.txt badlist1.txt testlist1.txt
  4. python classifynews2.py goodlist2.txt badlist2.txt testlist2.txt

You should report the results with testlist1.txt in your report.pdf. For the other two, you won't know the correct answers for certain (we'll check). Your report should contain the following information:

  1. The results of the experiments outlined above.
  2. A description of your "new & improved" classifier. It should be detailed enough that someone could reproduce your work.
  3. An explanation of why you believe your approach performed better (or worse) than the baseline algorithm. It is OK if your approach ends up not performing as well as the baseline approach.

Baseline Algorithm for classifynews1.py

Use the function hash() to convert each word into a number (you will need to "import os" to get access to the hash function). You should use the modulo operator "%" to convert these values to a number between 0 and 999 (e.g. hash('joe') % 1000). You should then create a list of 1000 elements that counts the number of times each word appears.

Your baseline algorithm should follow these steps:

  1. Create "goodlist" a list of 1000 elements that represents the total number of times each word occurs in the good news files.
  2. Create "badlist" a list of 1000 elements that represents the total number of times each word occurs in the bad news files.
  3. Compute goodp and badp that each represent the relative frequency of each word's occurrence in the corresponding files (e.g., goodp = goodlist / sum(goodlist) )
  4. Create "weights" a list 1000 that represents the "goodness" of each corresponding word, as follows:
    1. weights = (goodp - badp) / (goodp + badp)
    2. The intent of the above equation is to create a negative value for "bad" words and a positive value for "good" words. In the extreme, the "most" bad word has a weight of -1, and the most good word has a weight of +1
  5. For each article that you classify, follow this procedure:
    1. Process the file into a list of 1000 words (similar to the earlier steps above) that counts the number of time each word occurs.
    2. Score the list as follows
      1. score = sum(wordlist * weights)
    3. If the total is positive, rate it as a "good" article, otherwise rate it as a "bad" article.

Improvements for classifynews2.py

There are a lot of ways you can improve on the baseline algorithm described above. Here are two ideas:

  1. Modify the size of the lists used to hold the words. Currently this list is 1000 elements. Performance might change if you change this number.
  2. Modify the scoring equation. One idea is to consider that words that occur less frequently should have a greater weight. How might you adjust the equation to reflect this?

Software to write

  • Create two programs classifynews1.py and classifynews2.py


Submit files (attachments) via t-square

  • Your code classifynews1.py and classifynews2.py
  • Your report in report.pdf

How to submit

Go to the t-square site for the class, then click on the "assignments" tab. Click on "add attachment" to add your 3 files. Once you are sure you've added the files, click "submit."