2014Fall7646 Project 1B

From Quantwiki
Jump to: navigation, search

Important Updates

1) Use natural log (log base e)

2) When dividing, use floating point, not integer division

3) People seem to have a lot of questions about the command line and standard out, etc., especially those using windows. Here are some points regarding that:

a) To create the output file, just use print statements. This will cause the output of your program to go to "standard out" which is a common Unix term that refers, simply, to the place stuff goes when you print. Here's a link with some more info on that: http://www.diveintopython.net/scripts_and_streams/stdin_stdout_stderr.html

b) Do not worry about processing wildcards (e.g., "*"). Just be sure your program can be executed with a variable number of input files (documents) in an unknown order. For instance, we should be able to run your program with the following command lines:

  • python tfidf.py ballmer01.txt ballmer02.txt ballmer03.txt balmer04.txt balmer05.txt balmer06.txt balmer07.txt
  • python tfidf.py balmer04.txt balmer05.txt balmer06.txt balmer07.txt

c) Do not write code that explicitly opens a file named "output.csv", just print.

Project 1B Overview

The objective of this project is for you to develop a system that can classify documents as "good" or "bad" regarding a stock. You will train the system with example good and bad documents. After the training you will then test your system with additional documents it has not seen before to assess how accurate it is. We will complete this project in a number of short sub-projects.

  • Part A: Convert words to numbers (done).
  • Part B: Create bag-of-words vectors based on tfidf for some example documents (this project).

Part B: Compute tf-idf vector for example files

In this sub-project you will compute the tf-idf value for each word in several small documents and print out the result in a CSV (comma separated value) file. Because there are many ways to compute tf and idf, we are going to specify which ones to use for this assignment. For the final part (Part C) you will be free to design your own versions.

Definition of tf(t,d)

The value of term frequency for term t in document d is defined as follows:

  • let count_t be the number of times t occurs in document d
  • let max_w be the maximum number of times any word occurs in document d
  • tf(t,d) = count_t / max_w

Definition of idf(t,D)

The value of inverse document frequency for term t in document set D is defined as follows:

  • let N be the number of documents, i.e., |D|
  • let count be the number of documents term t appears in (ranges from 1 to N)
  • idf(t,D) = log_e(N/count)

Definition of tfidf(t,d,D)

  • tfidf(t,d,D) = tf(t,d)*idf(t,D)

Code to write

Your program, tfidf.py, should read in a number of files provided to it, then output a CSV that reports the tf-idf vectors for each document. The file should follow the following format:

term, ballmer01, ballmer02, ballmer03, ballmer04, ballmer05, ballmer06, ballmer07
ballmer, 0.9823, 0.1000, 0.2837, 0.0040, 0.8912, 2.7982, 3.2343
the, 0.0823, 0.9000, 0.2837, 0.0040, 0.8912, 2.7982, 3.2343

The first row provides descriptions for the data. In the first row, after "term" we see the names of the documents for which the vectors are computed. Each column represents the vector for a corresponding document. The first column lists the terms or words corresponding to each row.

The documents you should use are included in a zip file: Media:Ballmer.zip. The files are named ballmer01.txt ... ballmer07.txt.

Other notes:

  • Command line usage: Your program should be able to accept a variable number of input documents, e.g.:
    • python tfidf.py ballmer01.txt ballmer02.txt ballmer03.txt balmer04.txt balmer05.txt balmer06.txt balmer07.txt
    • python tfidf.py balmer04.txt balmer05.txt balmer06.txt balmer07.txt
  • Your program should print its output to standard output.
  • The result of executing this command is the file output.csv that contains the tfidf vectors for each file.

Part B Deliverables

Submit files as individual attachments via t-square. Please do not zip up all your files into a zip file.

Run your code using the files in the zip file linked to above.

  • Your code in tfidf.py
  • A SINGLE Report (in a pdf file), report.pdf including:
    • The report should include the name of the course, project name and student name.
    • The report should include one or two paragraphs explaining how you solved the problem in the assignment.
    • The report should include the output of tfidf.py .
  • Your numerical results in the file output.csv

How to submit

Go to the t-square site for the class, then click on the "assignments" tab. Click on "add attachment" to add your N files. Once you are sure you've added the files, click "submit."

Part B Extra Credit

  • None yet

Part B Rubric

  • report.pdf
  • tfidf.py
  • output.csv should include
    • all unique words
    • one column for each file, plus one for the words
    • correct tfidf value for each word/document pair
  • more detail to be added here for rubric