How to get leads via a single DATA step?
Posted: 17 August 2012 10:19 PM   [ Ignore ]
Newbie
Rank
Total Posts:  8
Joined  2012-08-15

** Let’s say you have daily (trading-date) data with variables DATE and PRC in data set HAVE, and you want, for each day, NEXT_PRC, the price on the following trading day.

But while there is a LAG function in SAS, there is no LEAD function.  You have to “read ahead” using a different technique, like this:

data want;
    
set have  end=end_of_have;
    if 
end_of_have then next_prc=.;
    else 
set have (firstobs=2 keep=prc rename=(prc=next_prc));
run

The technique is to use two SET statements.  One does not affect the other - they are separate “streams” of data.  The main difference is that the second SET statement has a “firstobs=2” data set parameter - which tells SAS to start that stream at the second observation.  That is, to “read ahead”.  So while the first set reads observation I, the second reads observation I+1.

The second set statement is told to read in only the PRC variable (“keep=prc”), but of course if it keeps that name, PRC from the second SET would overwrite the PRC from the first.  So it is also told to rename prc to NEXT_PRC.  Result: 3 variables:  DATE, PRC and NEXT_PRC.

Now what is this “end_of_have” stuff?  The “end=end_of_have” option on the first SET statement says to make a (temporary) dummy variable which is a 1 when the record-in-hand is the last record in HAVE (otherwise it’s a zero).  That’s neccessary because when the first SET HAVE reads the last case, the second SET HAVE would attempt to read beyond the last case and SAS would instantly stop, without writing out a record.

That is, if HAVE had 100 records, WANT would have only 99.  So the “else if” above means the second SET runs only when the first has not hit the end, and therefore will not prematurelhy end the data step.  As a result WANT will have 100 observations. 

The only other note to make is that when the last case is in hand, you want NEXT_PRC to be a missing value.  Hence the “if end_of_have then next_prc=.;”.  Without this, the last case would have a NEXT_PRC value carried over from the “penultimate” record.

In my next entry I’ll deal with leads in the presence of BY groups.

Profile
 
 
Posted: 18 August 2012 06:15 AM   [ Ignore ]   [ # 1 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

Thanks! Very useful!!

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 18 August 2012 10:26 AM   [ Ignore ]   [ # 2 ]
Newbie
Rank
Total Posts:  8
Joined  2012-08-15

In the above post, there was an assumption that dataset HAVE only contained a single entity (for instance, a single PERMNO in the CRSP database).  But most WRDS data has data sorted by PERMNO (or some other ID like GVKEY, CUSIP, etc.), then DATE. So using the technique above would wrongly give the last record from one PERMNO a lead value taken from the first record in the next PERMNO.

Now, if you’re just doing a single period lead, then the above code could be modified to:

data want;
   
set have end=end_of_have;
   
by permno;
   if 
end_of_have then next_prc=.;
  else 
set have (firstobs=2 keep=prc rename=(prc=next_prc));
  if 
last.permno then next_prc=.;
run

The only new components here are the “BY PERMNO” and “IF LAST.PERMNO ...” statements.  “BY” statements can be use in both sas PROC’s and sas DATA steps.  In both cases, you are telling SAS to expect the data to be ordered in ascending PERMNO groups.  For PROC’s it tells SAS to run the proc once for each BY group.

But in the DATA step case, it tells SAS to create 2 temporary dummy variables: first.permno and last.permno.  These variables tell you whether the record in hand is the first or last record for a given PERMNO.  As a result, the “if last.permno” statement is an obviously easy way to avoid wrongly getting a NEXT_PRC value from the subsequent PERMNO.

However the technique of “if last.permno ...” and “if end_of_have ...” won’t hold up if you need a lead of 2 periods or more.  That will require a somewhat different approach.

Profile
 
 
Posted: 18 August 2012 08:23 PM   [ Ignore ]   [ # 3 ]
Newbie
Rank
Total Posts:  8
Joined  2012-08-15

If you’ve read the earlier parts of this thread but you want (say) leads of the PRC and VOL variables for 1 through 3 periods (call them prc1, prc2, prc3, and vol1, vol2, vol3), taken from a data set sorted by PERMNO, DATE, then you need another approach, like this:

data want (drop=_p:);
  
set have nobs=nrecs;
  
by permno;

  if 
_n_<= nrecs-1 then set have (firstobs=2 keep=permno prc vol rename=(permno=_p1 prc=prc1 vol=vol1));
  else 
_p1=.;
  if 
_n_<= nrecs-2 then set have (firstobs=3 keep=permno prc vol rename=(permno=_p2 prc=prc2 vol=vol2));
  else 
_p2=.;
  if 
_n_<= nrecs-3 then set have (firstobs=4 keep=permno prc vol rename=(permno=_p3 prc=prc3 vol=vol3));
  else 
_p3=.;

  array 
id {*} _p1-_p3;
  array 
pr {*} prc1-prc3;
  array 
vl {*} vol1-vol3;

  do 
I=1 to dim(id);
    if 
id{I}^=permno then do;
      
pr{I}=.;
      
vl{I}=.;
    
end;
  
end;
  
drop I;
run

This program has one additional SET statement for each lead period, using progressively larger FIRSTOBS values.  Using the “last.permno” technique above won’t be sufficient in this case, since it is not adequate for timely resetting of 2-period (or longer) leads.

So this time, instead of reading in only the variables (PRC and VOL) for which a lead is wanted, the added SET statements also read in the value for PERMNO.  These lead PERMNO values (_p1, _p2, ...) will be compared with the current PERMNO to determine whether the corresponding PRC and VOL leads should be set to missing.

Also, instead of using the “end=” parameter in the primary SET statement, this example uses “nobs=nrecs”, which creates a temporary variable named NRECS containing the number of observations in HAVE.  So if HAVE has 100 observations (i.e. NRECS=100), then you know that the last 1-period lead to be read in should be at observation 99, the last 2-period lead at 98, etc.

In this program you can use the automatic variable _N_ to know which observation is the record in hand (actually _N_ counts the number of iterations of the DATA step, but in this example it means the same thing).  So (for 100 observations in HAVE), you want to stop reading 1-period leads at _N_=99, stop reading 2-period leads at _N_=98, etc.  That’s what the “if _N_<= ...” statements are for.

That leaves only the problem of correcting lead values when the read-ahead technique gets data from the next permno.  Basically you need a set of statements like:

if permno^=_p1 then do; prc1=.; vol1=.; end;
if 
permno^=_p2 then do; prc2=.; vol2=.; end;
   ... 
etc... 

That’s done via the ARRAY statements in combination with the DO loop which iterates from the 1st to last elements of each of the arrays, in parallel, producing the needed “if permno^=” statements.

Profile