The purpose of this tutorial is to illustrate how to use SAS to get data from Compustat, and how to perform some basic data management.
We will compute the market-to-book ratio and return on assets for Google and compare it to its industry median for each of the years that Google has been listed.
For the market-to-book ratio we need the number of shares outstanding, end of year stock price, and book value of equity. For return on assets we will use net income and end of year total assets. We also need to know Google's industry membership. We will use SIC and download all firm-years with the same SIC.
The data we need is in Compustat Fundamental Annual (FUNDA). For an overview of all the variables, go to:
- WRDS website (wrds.wharton.upenn.edu), login
- Support (top menu)
- Dataset List (second column, middle)
- Compustat (/wrds/compa) - COMP:NAA [ /wrds/comp/sasdata/naa ] (North America, Annual update)
- FUNDA Merged Fundamental Annual File (second dataset from the top)
I select the following variables:
- gvkey: Global Company Key
- datadate: Data Date (last day of the month of the end of the firm's fiscal year)
- fyear: Fiscal Year
- sich: Standard Industrial Classification - Historical
- ceq: Common/Ordinary Equity - Total
- prcc_f: Price Close - Annual - Fiscal
- csho: Common Shares Outstanding
- ni: Net Income (Loss)
- at: Assets - Total
FUNDA holds various firm identifiers. Like a person has a name, a phonenumber, a social security number, an employee or student number to identify the person, firms have
several identifiers. FUNDA holds the following:
- gvkey: Global Company Key
- tic: Ticker Symbol
- cusip: CUSIP
- conm: Company Name
- CIK: CIK Number
First, we need to find out the value of any of these firm identifiers for Google. Let's use the lookup function from WRDS, this will give us Google's SIC as well. Go to:
- Wrds home
- Dropdown upper left: ‘select a data set’ => Compustat
- Click left panel: North America
- North America Web Query Forms: Fundamentals Annual (first option, upper main panel)
- In web form, scroll to ‘Step 2’, there is a ‘code lookup’ hyperlink
- Enter ‘Google’, click ‘Lookup Codes’
Google's gvkey is "160329" and SIC is 7370. Note that gvkey is text (a string) and SIC is a number.
Let's start by downloading the data items for Google. First, we need to assign a library that holds the dataset on our computer. We can give it any name; I am using 'example' as the library name. I am using "C:\temp\google_mtb\", but you can change this. Make sure the directory exists. To run code on SAS, select the code to be executed and press F3. Pressing F3 when no code is selected will execute the current file in the editor.
You can download the sascode here (right-click and save target as).
libname example "C:\temp\google_mtb\";
Since the databases are on the servers of WRDS, we need to connect to their server, and execute SAS code remotely. We will select all data from FUNDA where gvkey equals "160329", and then download the resulting dataset to our computer. Code that is within 'rsubmit' and 'endrsubmit' is executed remotely.
/* details about the PC-connect (this is always the same), it will trigger a login prompt */ %let wrds = wrds.wharton.upenn.edu 4016;options comamid = TCP remote=WRDS; signon username=_prompt_; /* everything after 'rsubmit' will be executed on WRDS!! */ rsubmit; /* data is organized in libraries, locally, as well as on WRDS tell SAS to look in '/wrds/comp/sasdata/naa' when you use 'comp' as the library name */ libname comp '/wrds/comp/sasdata/naa'; /* This code: "and indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C'" INDFMT is the Industry Format code that indicates whether a company reports in a Industrial format (INDL). DATAFMT is the Data Format code that indicates how the data is collected and presented, with Standardized (STD) being the common and useful type. POPSRC indicates the Population Source of the data as Domestic (D) or International (I), and ‘D’ includes Canadian companies. CONSOL shows the Level of Consolidation that identifies whether a company’s financial statements represent consolidated (C) or nonconsolidated information (N), or another type. Additional CONSOL cases allow parents and subsidiaries to have the same GVKEY and can show PRE_FASB and PRE-AMEND cases, but failing to screen by this code may produce duplicate records for a GVKEY-DATADATE. Detailed info can be found in the dataguide on compustat.com/dataguide For example: INDFMT = INDL includes industrial firms (includes companies reporting manufacturing, retail, construction and other commercial operations other than financial services). It can also be “FS” for financial services, “BANK” and “ISSUE”, for issue-level data. */ PROC SQL; create table example (keep = gvkey datadate fyear sich ceq prcc_f csho ni at) as select a.* from comp.funda a where a.gvkey = "160329" and indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C' ; quit; /* since the dataset is created on the WRDS server, it needs to be downloaded in order to access it locally; download the dataset work.example from the server to your computer */ proc download data= example out=example.a_google; /* 'run' tells SAS to start executing the submitted statements */ run; /* stop processing statements on WRDS and continue working locally */ endrsubmit;
Visually inspect the dataset by clicking on the dataset example.a_google. If it does not exist, then examine the log. Sometimes, WRDS performs maintenance (see announcements on their website). A more likely candidate for errors are a typo in the username/password, or a mistake in the code.
We see that 2004 is the first year where Google has an end of year stock price. The years before that are pre-IPO where Compustat has taken the data from the IPO prospectus, Form S-1. If you want to verify the numbers, take a look at the prospectus: http://www.sec.gov/Archives/edgar/data/1288776/000119312504073639/ds1.htm .
Next, we download the same data items for the other firms with the same SIC code for the years 2004 and later.
rsubmit; libname comp '/wrds/comp/sasdata/naa'; PROC SQL; create table ex7370 (keep = gvkey datadate fyear sich ceq prcc_f csho ni at) as select a.* from comp.funda a where a.sich = 7370 and a.fyear >= 2004 and indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C' ; quit; proc download data= ex7370 out=example.b_7370; run; endrsubmit;
We collected the data, so the remaining steps can be done on our own computer. First, for all firm-years we compute the ratios. We use a 'data step' to do that. The dataset following "data" is the name of the dataset we create. The "set" dataset is the source that is used to create this.
The lines where "if mtb ne ." and "if roa ne ." require that these variables are not equal to 'missing'. In other words, all firm-years where either variable is missing are dropped from the sample.
* compute the ratios for Google's firm years; data example.a_google_2; set example.a_google; mtb = prcc_f * csho / ceq; if mtb ne .; roa = ni /at; if roa ne .; run; * compute the ratios for the other firms; data example.b_7370_2; set example.b_7370; mtb = prcc_f * csho / ceq; if mtb ne .; roa = ni /at; if roa ne .; run;
For each year, we want to compare Google's ratios with the median firm. We choose the median because outliers can have a large influence on the average. We will use "proc means" to compute the median value of MTB and ROA for each year. When we want proc means to compute variables by some dimension (in this case fiscal year), we first need to sort the dataset.
Notice that the proc means code includes: 'where gvkey ne "160329";'. This means that Google is excluded from the calculation. Although the effect of Google will be small on the median, it is nevertheless cleaner.
The line with OUTPUT OUT tells SAS to create a new dataset, that will include the mean, median and the number of observations for each variable, for each year.
'by fyear' tells SAS to repeat the procedure for each year. This is important, as we don't want a single median for the whole period. Instead, we want a median for each year.
***************************** figure out the mean, median - by year; proc sort data = example.b_7370_2; by fyear;run; /* This creates a sample with the mean and median by year */ proc means data=example.b_7370_2 n mean median; where gvkey ne "160329"; * exclude Google; OUTPUT OUT=example.c_7370 mean= median= N=/autoname; var mtb roa; by fyear ; run;
We now have two datasets we need to join. We have a set (example.a_google_2) where each observation holds mtb and roa for one year for Google. And, we have the dataset with the industry medians (example.c_7370), where each line holds mtb and roa for one year.
I use proc sql to match these two sets. Notice the "a" and "b" in this line: 'from example.a_google_2 a, example.c_7370 b'. This enables me to use "a" as short for "a_google_2", and "b" as short for "c_7370". That means, for example, that I can write "a.fyear = b.fyear" instead of "a_google_2.fyear = c_7370.fyear ".
***************************** join the sets; proc sql; create table example.d_final as select a.fyear, a.mtb as mtb_google, a.roa as roa_google, b.mtb_median as mtb_industry, b.roa_median as roa_industry from example.a_google_2 a, example.c_7370 b where a.fyear = b.fyear; quit;
Finally, we export the dataset from SAS and write it as a textfile, to be processed further. In this case, I will use gle-graphics to create graphs.
***************************** export; %macro myExport(dset=, file=); %let filename = "&file"; PROC EXPORT DATA= &dset OUTFILE=&filename DBMS=CSV REPLACE; RUN; %mend; %myExport(dset=example.d_final, file=C:\temp\googlemtb.dat);
The market-to-book ratio and return on assets for Google versus the industry median:
The graph indicates that Google's market-to-book ratio has dropped significantly after its IPO (which is typical). In 2010 it is still well above the industry median.
Google's performance (ROA) is strong compared with the industry median in each of the years.
|Other Tutorials WRDS|
|WRDS using SAS: Getting started|
|Compustat: compute ROE|
|Restatement event study|
|Estimating the earnings response coefficient (ERC)|
|Latest forum posts|
|Have trouble with replicating the RESTATEMENT example by sxxapple|
|how to read restatement data from GAO website? by Clark|
|winsor by none? by Clark|
|the tutorial to generate inflation adjusted time series. by pwyw000|
|Restatement Example Regression Results by Zenghui|
|Generate our own erdport1 database from DSF file by Zenghui|
|ROE example by Zenghui|
|SAS SPEDIS function by Zenghui|