Thanks - wasn’t aware of the 3rd option (proc expand). Looks like proc expand will bite dust (just like lag()) if a year is not included (in that case, for example a lag of 1 will be the value of 2 years ago)

Here’s how to get lags in the presence of BY groups:

data want;
set msf; ** say the CRSP monthly stock file **;
by permno;
prior_prc=lag(prc);
if first.permno then prior_prc=.;
run;

** Now what if you want lags of more than one period? There are many ways, but this might be termed the “classic” SAS method (before DOW loop was popularized).

In it you create an “order” variable which is set to 1 for the first record of each PERMNO, then 2, 3, etc. Once you know the record-in-hand has order of (say) 2 then you know you must reset all lags beyond one period to missing (.). In this case, you have to RETAIN the order variable - otherwise SAS resets it to missing at the top of each iteration of the DATA step **;

data want (drop=I);
set msf;
by permno;
retain order;
if first.permno then order=1;
else order=order+1;

** Now if we’re at or near the beginning of a PERMNO group, reset lags to missing **;
array priors {4} prior1-prior4;
if order<=4 then do I=order to 4;
priors{I}=.;
end;
run;

BTW, regarding the LAG function, be carefull about using it in an IF condition. Do NOT assume that “if ... then x=lag(x)” will examine the IF clause and then get the X from the prior record when the IF condition is meant. It won’t. It will get the X from the prior time that the IF condition was filled, which might be several obserations previously. That’s why this program ALWAYS creates the lag first, then resets to missing when appropriate.

Sorry I am new to SAS but am having issues working out the code for what I need.

I have a large dataset (Initialtesting), and one of the variables I have is yearly sales revenue (label=sale). I need to come up with a lag measure for this variable, (sales from the previous year). I also need to come up with (sales from the year before that).

The difficulties arise because:

* I have multiple organisations (label=comn) over multiple periods (end of financial year information, label=fyear) , so I need SAS to take this into account.

*I have missing data, and I want SAS to skip where there is missing information.

You can use the ‘data step’ to do this with the ‘lag’ function. My preferred way of doing this is with proc sql, especially if you need to go back multiple years.

The ‘trick’ is to do a join with the table itself, like so:

proc sql; create table newTable as /* a.*: select all variable you already have (appending variables) */ select a.* , /* give unique names; fyear_prev is for debugging (easy to make mistakes with the a.fyear-1 at the end*/ b.sales as sales_prev, b.fyear as fyear_prev from yourtable a left join yourtable b on /* same firm (obiously) */ a.gvkey = b.gvkey /* the data from b will be 1 year before the data in a*/ and a.fyear-1 = b.fyear; quit;

Where gvkey is a firm identifier, and fyear is the year. I used ‘left join’ so to keep observations for which the previous year’s data is missing. Repeat the same proc with ‘and a.fyear - 2 = b.fyear’ to append the sales for 2 years ago.

Make sure that each gvkey-fyear is unique. Also, give the variables unique names (you are appending variables to the original table, if you give the same name as an existing variable, it won’t work as planned).

I find using PROC SQL to be far slower than the data step. After all PROC SQL must compare every pair of records to select the matched cases.

But if the data are sorted, then the DATA step (with a LAG function), need only compare consecutive observations - that’s N comparisons instead of N-squared. The bigger the dataset, the bigger this disparity will be.

Let’s say your data is annual, and is sorted by GVKEY/FYEAR. Put aside for the moment the possibilty that companies can change fiscal year’s (so there can be two records with the same FYEAR). The following code gets lagged values (1-year and 2-year lag), but leaves missing values when there is a missing year.

data want; set have; by gvkey fyear;

sales1=ifn(gvkey=lag(gvkey) and year=lag(year)+1,lag(sales),.); /* SALES from prior record, IFF it has the same GVKEY and is exactly 1 year old */

sales2=ifn(gvkey=lag2(gvkey) and year=lag2(year)+2,lag2(sales),.); run;

And how do you handle a company changing fiscal years - resulting in two records for a given GVKEY/FYEAR? Just make a policy on whether you want the first, or last record for that FYEAR. If you want the last such record, then right after by “BY GVKEY FYEAR” statement, add this one:

I find using PROC SQL to be far slower than the data step. After all PROC SQL must compare every pair of records to select the matched cases.

But if the data are sorted, then the DATA step (with a LAG function), need only compare consecutive observations - that’s N comparisons instead of N-squared. The bigger the dataset, the bigger this disparity will be.

Let’s say your data is annual, and is sorted by GVKEY/FYEAR. Put aside for the moment the possibilty that companies can change fiscal year’s (so there can be two records with the same FYEAR). The following code gets lagged values (1-year and 2-year lag), but leaves missing values when there is a missing year.

data want; set have; by gvkey fyear;

sales1=ifn(gvkey=lag(gvkey) and year=lag(year)+1,lag(sales),.); /* SALES from prior record, IFF it has the same GVKEY and is exactly 1 year old */

sales2=ifn(gvkey=lag2(gvkey) and year=lag2(year)+2,lag2(sales),.);

run;

And how do you handle a company changing fiscal years - resulting in two records for a given GVKEY/FYEAR? Just make a policy on whether you want the first, or last record for that FYEAR. If you want the last such record, then right after by “BY GVKEY FYEAR” statement, add this one:

Thank you for all your help, it worked like a charm.

Just a follow up question. I was warned that there were similar issues with using diff functions as lag functions.

If I am using the same dataset, and want to calculate differences between years, e.g. difference in sales year to year, considering the difference firms (e.g. using gvkey) and time periods, how would I go about this?

Indeed with getting the lagged (or lead) variables, you need to take care it is for the same firm. But, once you have these variables, there are no issues/complications for computing the difference, for example: sales_ch = sales - sales_prev (where sales is sales for the current year, and sales prev is lagged sales).

Indeed with getting the lagged (or lead) variables, you need to take care it is for the same firm. But, once you have these variables, there are no issues/complications for computing the difference, for example: sales_ch = sales - sales_prev (where sales is sales for the current year, and sales prev is lagged sales).

best regards,

Joost

Hey thank you. However, I was wondering how to calculate differences between variables over years. E.G. one of my variables is total assets. How do I calculate differences between years, considering I have multiple firms. Can it be done without lagging and the. Subtracting because I don’t need the lagged variables themselves.

Sorry to keep bugging you. I am still getting the hang of this

sure, no problem. This is probably helpful for others as well.

The compute the change in assets, you need:
- retrieve gvkey, fyear, and at (assets)
- to get lagged assets (either use proc sql, or the data step as pointed out by sassafras)
- subtract lagged assets from current assets

You could combine steps 2 and 3 in a single step (which makes it a bit harder to read). You can also drop the lagged variables once you are done with them.

proc sql; create table newTable as select a.* , /* table a is current year, table b is lagged year, so a.at - b.at is change in assets */ (a.at - b.at) as at_ch from yourtable a left join yourtable b on a.gvkey = b.gvkey and a.fyear-1 = b.fyear; quit;