2014Fall7646 Project 1C

From Quantwiki
Jump to: navigation, search

Important Updates

Project 1 Overview

The objective of this project is for you to develop a system that can classify documents as "good" or "bad" regarding a stock. You will train the system with example good and bad documents. After the training you will then test your system with additional documents it has not seen before to assess how accurate it is. We will complete this project in a number of short sub-projects.

  • Part A: Convert words to numbers.
  • Part B: Compute tfidf vectors for 7 example documents.
  • Part C: Classify documents as "good" or "bad" according to examples we will provide.

Part C: The Complete Solution

You will be provided 16 documents. 8 of them will be provided in a directory called "bad" and 8 of them will be in a directory called "good." These files are here: Media:Goodbad.zip.

Part 1: Your task is to conduct leave-one-out cross validation as follows: Use tf-idf to classify each document as Good or Bad according to the classification of the most similar document of the other N-1 documents. You should use the instructor's tf-idf methodology to make this determination. Use the cosine method described here to determine document similarity.

Your program should accept a command line with a variable number of input files, such as:

  • python tdfidf.py good01.txt good02.txt bad01.txt bad02.txt

Your output file should have the following format:

filename, closest match, cosine
good01.txt, good02.txt, 0.945
good02.txt, good01.txt, 0.945
...
bad01.txt, good02.txt, 0.878
bad02.txt, bad01.txt, 0.890

There is one correct solution to this part of the project.

Part 2: Devise your own metric for assessing the effectiveness/correctness of the above method for classifying documents. Describe it in detail in your report and use it to measure the correctness of the solution in Part 1.

Your program should accept a command line with a variable number of input files, such as:

  • python tdfidfmetric.py good01.txt good02.txt bad01.txt bad02.txt

Your output file should have the following format:

filename, closest match, cosine
good01.txt, good02.txt, 0.945
good02.txt, good01.txt, 0.945
...
bad01.txt, good02.txt, 0.878
bad02.txt, bad01.txt, 0.890
overall score: 95.05%

Note the additional, new last output line that includes the overall metric that you developed.

Part 3: Develop an improvement in the classification algorithm in some unique way. And assess it according to your metric in Part 2. Possible potential improvements include, but are not limited to:

  • Instead of just using words by themselves, create pairs of words.
  • Use a "stemmer."
  • Create a better tf or idf function.
  • Create a better difference/distance metric (replacement for the cosine method linked to above).
  • Consider "voting" methods where each document votes on whether the candidate document is of the same classification.

The output of this file can be tailored as you see fit to as to appropriately highlight your improved approach.

Part C Deliverables

Please submit your results in 5 separate files as follows:

  • output.csv from Part 2 (see example format above)
  • Python code for Part 1
  • Python code for Part 2
  • Python code for Part 3
  • report.pdf which contains the following:
    • If there is anything unusual or different regarding your solution to Part 1, include a description. We anticipate that most people will not have anything to say here.
    • If you did not get exactly the correct answer for Part 1, explain why you think that is.
    • An explanation of the metric you devised to assess the results of the classification.
    • A description of the improvement you made to the code for Part 3, including why you think it will/should work.
    • If the method did not work, explain why you think that was the case.
    • An assessment and comparison of the "official" Part 1 method, and your method in Part 3.
    • Appendix: Your program's output for Part 2.
    • Appendix: Output for Part 3.

How to submit

Go to the t-square site for the class, then click on the "assignments" tab. Click on "add attachment" to add your N files. Once you are sure you've added the files, click "submit."

Part C Extra Credit

Part C Rubric

  • Was all the code included?
  • Did the report contain all the required components?
  • Was the output for Part 1 correct? -1 point for each incorrect "best" match.
  • Was the proposed metric reasonable?
  • Full credit if proposed method improves classification.
  • Full credit if proposed method does not improve classification, but the idea was reasonable and made sense.
  • -10% if the proposed method does not make sense.