Category: Data management

Author resource: Joost Impink


C++ code to compute Herfindahl indices. The Herfindahl index is computed at the industry level, as well as at the firm level (where it is based on other firms in the industry).

To use the Herfindahl index as an instrument in 2SLS/Heckman/propensity score matching, the firm-effect on the Herfindahl index is removed. In other words, it is computed using the other firms in the same industry.

Sample usage (below) includes SAS code to create a dataset with firms' segment sales. Firms with no segment disclosures are assumed to be single segment firms (for these firms, the Funda sales and industry codes are used).

At the end of the C++ code several 'tags' are included by the forum engine. These should not be copied. For brevity, there are no mechanisms included to detect exceptions. For example, C++ won't be able to overwrite the output file if it is also open in Excel.

Why not compute the metrics in SAS? The C++ code capitalizes on modifying the datasets, while in SAS a copy of the dataset is made (which is slow). The C++ program finishes in 12 seconds.

The generated Herfindahl indices are matched onto the Funda/Segment files. The Herfindahl measures are computed at the segment level. Multiple-segment firms thus compete in several industries. Firmlevel Herfindahl indices are the value-weighted average of the segments' Herfindal indices (value weightin using segment sales).


	File: 			Herf.cpp
	Created by:		Joost Impink
	Date: 			July 2012
	Purpose:		Compute Herfindahl index at industry level, and firm level
					(At firm level the firm itself is excluded, and only other
					firms in the same industry-year are used)

using namespace std;

// filenames
const string inputFile   = "/media/vboxshared/sas_exports/Herf_SAS_export.txt";
const string outputFile  = "/media/vboxshared/sas_exports/Herf_industry.csv";
const string outputFile2 = "/media/vboxshared/sas_exports/Herf_firmlevel.csv";

struct segment 
  string gvkey;
  int    NAICS;
  float  sales;
  int    indYear;

void loadData();
void processData();
void processIndustry();
float computeHerf(vector vSeg);
float industrySales(vector vSeg);

vector allSegments, indSegments;

ofstream outFile, outFile2;

int main()
  // load data

  // prepare output;;

  // process data
  // close output file

  return 0;

void processData() {

    This function 'chops' the allSegments data into chuncks by
    industry-year (indYear)

    Firm-years in the same industry-year are removed from allSegments,
    and moved onto indSegments, which is processed by

  // output file headers
  outFile << "indYear,Herf\n";
  outFile2 << "indYear,gvkey,Herf\n";

  do {

    /* determine industry-year to process (fill up indSegments)
       => working back from the end of the vector is fastest
       => take last element of allSegments, move it into indSegments
       => select all following (working backwards) elements of allSegments
       as long as they ar in the same industry-year (indYear)

    // move last element from allSegments to indSegments (which is
    // empty at this point)

    // industry-year code to process
    int ind = indSegments[0].indYear;

    // find all segments that operate in the same industry-year
    int vSize = allSegments.size();
    int ii = 0;
    // working backwards is fastest (pop_back() is fast, there is no
    // pop_begin())
    while (allSegments[vSize-1-ii].indYear == ind) {


    // indSegments now holds all segments in one particular indYear 

    // clear indSegments;

  } while (allSegments.size()>0);

   return ;

void processIndustry(){

      This function loops through the indSegment, each time making a
      copy of the full vector and deleting one segment (this is the
      segment the measure is computed for)

  // compute Herfindahl for industry indSegments will be empty (only)
  // when the process is complete

  if (!indSegments[0].indYear) return;
  float Herf = computeHerf(indSegments);

   // output indYear, gvkey and Herfindahl index
  outFile << indSegments[0].indYear << "," << Herf << "\n";
  // compute Herfindahl for each firm in the industry (excluding firm
  // itself)

  // can't compute Herfindahl for other firms if firm is the only firm
  // in the industry-year
  if (indSegments.size() == 1) return;

   for(int ii=0; ii < indSegments.size(); ii++)
     // make a copy of indSegments
     vector cpySegments(indSegments) ;
     // drop the firm in question 
     // compute Herfindahl excluding the firm
     Herf = computeHerf(cpySegments);

     // output indYear, gvkey and Herfindahl index
     outFile2 << indSegments[ii].indYear << ",";
     outFile2 << indSegments[ii].gvkey << "," << Herf << "\n";

float computeHerf(vector vSeg){

     // determine total industry sales (excluding firm)
     float sumSales = industrySales(vSeg);

     // compute Herfindahl index
     float Herf = 0;
     for(int ii=0; ii < vSeg.size(); ii++)
       Herf += (vSeg[ii].sales / sumSales)*(vSeg[ii].sales / sumSales);
     return Herf;

float industrySales(vector vSeg){

   float sumSales = 0;

   for(int ii=0; ii < vSeg.size(); ii++)
     sumSales += vSeg[ii].sales;
   return sumSales;

void loadData(){

    using namespace std;

    ifstream file (inputFile.c_str());

    /* file structure
     007906,11,8776.9,1   */
    string line;

    while (getline(file, line)) {

        // cout << line << endl;

        istringstream tokenizer(line);

        string token;
	string item1;               // gvkey
	int    item2, item4;        // NAICS, indYear 
	float  item3;               // sales
	// first item gvkey (string)
        getline(tokenizer, item1, ',');

	// second item NAICS (int)
	getline(tokenizer, token, ',');
	istringstream int_iss(token);
        int_iss >> item2;
	// third item sales (float) 
        getline(tokenizer, token, ',');
	istringstream float_iss(token);
        float_iss >> item3;
	// fourth item indYear (int)
        getline(tokenizer, token, ',');
	istringstream int_iss2(token);
        int_iss2 >> item4;
	// create struct segments
        segment tmpSegm = {
	// push onto vector of segments
   return ;


Sample usage:


	Herfindahl index

	This script collects sales information at firm and segment level
	and exports it in a csv file for further processing in C++.


	- get Compustat Funda sample
	- match with wrds_segments to get operating segments
	- assume firms with no match on wrds_segmerged to be single segment firms
	- export as .csv

	- import .csv 
	- compute Herfindahl index for industry
	- compute Herfindal index for each firm-year (excluding the firm's segments)
	- export

	- import C++ export
	- match firm-year-segment to get Herfindahl index for that segment
	- aggregate (value weight average by segment sales) 

	- local access to comp.funda and segments.Wrds_segmerged

%let fyearStart = 1990;

/*	Industry variable is either NAICS or SIC; 
	uncomment only one of the two next lines */

%let industryVar = NAICS;
*%let industryVar = SIC;

/*	Create funda base dataset  (&industryVar is either NAICSH or SICH)*/

proc sql;

	create table a_funda as 
		select gvkey, fyear, datadate, sale, &industryVar.h
		from comp.funda 
			&fyearStart <= fyear 
		and indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C' 
		and sale > 0

/*	Force unique obs */

proc sort data = a_funda nodupkey; by gvkey fyear;run;

/*	Segments */

data b_bussegm (keep = GVKEY srcdate datadate sales NAICS_s SIC_s);
set segments.Wrds_segmerged;

/*	No segments from previous years */

if srcdate eq datadate;

/*	Operating segments, not geographical segments */

if stype IN ("BUSSEG", "OPSEG");

/*	No eliminations */

if SNMS ne "Eliminations";

/*	Subsample */

if year(datadate) >= &fyearStart;

/*	Numeric industries */

NAICS_s = 1 * NAICSS1;
SIC_s = 1 * SICS1;

/*	Drop segments with missing industry or missing sales */

if &industryVar._s ne "";
if sales > 0;

/*	Aggregate segments that are in the same industry */

proc sql;

	create table b_bussegm2 as
	select		gvkey, datadate, &industryVar._s, sum(sales) as sales
	from 		b_bussegm
	group by	gvkey, datadate, &industryVar._s;


/*	Join two sets*/

proc sql;

	create table c_joined as

		select a.*, b.sales, b.&industryVar._s
			a_funda a
			b_bussegm2 b
			a.gvkey = b.gvkey
		and a.datadate = b.datadate;

/*	Fill in the missing segment sales
	These are assumed to be single segment firms */

data c_joined2;
set c_joined;
if sales eq . then sales = sale;
if &industryVar._s eq . then &industryVar._s = &industryVar.h;

/*	Drop if no industry/sales known for segments, nor firm level */
if &industryVar._s ne .;
if sales ne .;

tempIndYear = &industryVar._s || fyear;

/*	For total sales, use sales figures of segments, and not firm level */

proc sql;

	create table d_sales as
	select gvkey, fyear, datadate, tempIndYear, &industryVar._s, sales as sales_segm, 
	sum(sales) as sales_firm
	from c_joined2
	group by gvkey, fyear;


/*	Create a numeric key to identify industry-years */

proc sort data = d_sales; by tempIndYear;run;

data e_exportready;
set d_sales;
retain indYear 0;
if first.tempIndYear then indYear = indYear + 1;
by tempIndYear;

data f_export (keep = gvkey naics_s sales_segm indYear);
set e_exportready;

PROC EXPORT DATA=  f_export 
            OUTFILE= "G:\shared_ubuntu\sas_exports\Herf_SAS_export.txt" 
            DBMS=CSV REPLACE;


				run C++ script at this point;


/*	Import Herfindahl indices generated by C++ script */

filename HERF_IND "G:\shared_ubuntu\sas_exports\Herf_industry.csv";

data g_herf_ind;
infile HERF_IND dsd delimiter=","  firstobs=2 LRECL=32767 missover;
input indYear herf_ind; 

filename HERF_CO "G:\shared_ubuntu\sas_exports\Herf_firmlevel.csv";

data g_herf_firm;
infile HERF_CO dsd delimiter=","  firstobs=2 LRECL=32767 missover;
length gvkey $ 6;
input indYear gvkey $ herf_firm; 

/*	Match Industry-level Herfindahl inded to base sample */

proc sql;

	create table h_withHerf1 as
	select a.*, b.herf_ind
		e_exportready a,
		g_herf_ind b
	where a.indYear = b.indYear; 

/*	Match Firm-level Herfindahl inded to base sample 
	(left join; industries with a single firm have missing value) */

proc sql;

	create table h_withHerf2 as
	select a.*, b.herf_firm
		h_withHerf1 a
		g_herf_firm b
	ON a.indYear = b.indYear
	and a.gvkey = b.gvkey; 

/*	Compute segment-sales weighted value to be used at the firm level 
	(This will only affect firms with multiple (reported) segments)

	Final dataset holds:
	- gvkey
	- fyear
	- herf_ind	(value of 1 for firms with no competitors in industry)
	- herf_firm	(missing value for firms with no competitors in industry)

	Firms with missing industry info in Funda as well as segment files
	are not included in final dataset


proc sql;

	create table i_firmlevel as
	select distinct gvkey, fyear, 
		sum(sales_segm * herf_ind / sales_firm) as herf_ind,
		sum(sales_segm * herf_firm / sales_firm) as herf_firm
	group by gvkey, fyear;


/*	proc sql will ignore missing variables (these are firms with no competitor)
	a correction is needed: if a firm has one segment with missing herf_firm,
	then the firmlevel weighted average needs to be missing as well */

/*	Select firm-segment-years with missing herf_firm */

data j_missing (keep = gvkey fyear setMissing);
set h_withHerf2;
if herf_firm eq .;
setMissing = 1;

proc sort data=j_missing nodup;by gvkey fyear;run;

proc sql;

	create table j_missing2 as 
	select a.*, b.setMissing
		i_firmlevel a
		j_missing b
		a.gvkey = b.gvkey
	and a.fyear = b.fyear;


/*	Set it to missing */

data k_finalHerf (drop = setMissing);
set j_missing2;
if setMissing eq 1 then herf_firm = .;

Other Data management
SAS Winsorize macro


Latest forum posts
Fama French 49 YES 49 !!! by CA Miller
date format by Emily
Discretionary Accruals by jwhi121
question about discretionary accrual models by kerrida
Combining Global and North American data by CA Miller
File for Eventus or SAS Event Study by samme
quarterly accruals by lxt88
Financial Statement Comparability - De Franco, Kothari and Verdi (2011, JAR) Replication by rowing
question about discretionary accrual models by Zenghui
problem in perl by KZ

All rights reserved. © 2010-2014 [Copyright] [Privacy Statement] [Disclaimer] [About]