QSTK Tutorial 6

From Quantwiki
Jump to: navigation, search


In this tutorial we will take a look at using QSTK code to utilize Compustat data. First, if you don't have the data yet, take a look at Getting Compustat Data. The same DataAccess API is used, so there should not be a big learning curve if you are familiar with manipulating price data.


Most of the module loading, installation, DataAccess API has been covered on previous tutorials, so we are going to skip a good amount and only touch base on what is new.

Make sure you have installed QSTK correctly. For instructions, visit QSToolKit_Installation_Guide.

Getting Available Items

With previous data access objects which just involved a few data types, e.g. open, close, etc., the data was requested with one of these labels hard-coded. Because Compustat has several hundred data items available, get_data_labels() will now return all available data features. It is helpful to get this list and create a dictionary mapping the label types to indexes as shown below.

lsItems = compustatObj.get_data_labels()
dLabel = dict( zip(lsItems,range(len(lsItems))) )

As before, individual timestamps are required for each day data is requested. However, it is safer to include non-trading days when requesting non-trading type data as shown below. The following code creates timestamps for all days in the last 5 years. Note that the time must be 1600 hours.

dtEnd = dt.datetime.combine( dt.datetime.now().date(), dt.time(16) )
dtStart = dtEnd.replace( year=dtEnd.year-5 )

tsAll = [ dtStart ]
while( tsAll[-1] != dtEnd ):
    tsAll.append( tsAll[-1] + dt.timedelta(days=1) )

The entire set of data labels can now be requested at once through the following code. It is recommended that all needed data is read at one time to minimize disk accesses.

dmValues = compustatObj.get_data( tsAll, symbols, lsItems)

This returns a list of DataMatrices which can be indexed using 0 - length of the item list, or with the dictionary create earlier, e.g.

dmKeys = dmValues[ dLabel['gvkey'] ]

The tutorial code then goes on to do some basic analysis of the compustat data for a few large-cap stocks and prints out some statistics. Basically it is showing which data items are commonly populated on the financial reports of large companies. This can help guide your machine learning efforts since you most likely want to avoid sparse data features. Please refer to your Compustat documentation to see descriptions of all data Features.

Our First Compustat Figure

A plot of EPS data.

A simple figure is then plotted using techniques covered previously. Quarterly earnings per share are used and the figure can be seen to the right. Note that the adjustments to the data come quarterly, as that is the period of reporting, as well as the fact that this is not always on the same date for all stocks.

It is important to note that this data is sparse time-wise, i.e. 365 rows are used to store 4 rows of actual data. A more efficient implementation may be required.