Perl 10-K Parsing Slow
Posted: 10 April 2016 12:55 PM   [ Ignore ]
Newbie
Rank
Total Posts:  16
Joined  2013-03-17

Dear All,

I am trying to use Perl to parse all the 10-ks since 1996 to extract some keywords. However, the running speed is slow.

I have tested the codes on a test sample of files before this large scale of parsing, and the speed is pretty fast when running on the test sample.

I am wondering is there a way to speed things up when parsing all 10-Ks?

Thank you very very much!!!

Best Regards

Profile
 
 
Posted: 10 April 2016 04:06 PM   [ Ignore ]   [ # 1 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi,

One thing that may speed up things is to run several instances of the script. Another thing is to use a strongly typed language (such as C++, Scala, Java) as opposed to a scripting language.

Best Regards,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 10 April 2016 04:49 PM   [ Ignore ]   [ # 2 ]
Newbie
Rank
Total Posts:  16
Joined  2013-03-17

Hi Joost,

Thanks much for the suggestion! I am actually running the scripts of several years together, but still seems not fast enough. Maybe it is constrained by the memory of my computer.

Best Regards

Profile
 
 
Posted: 10 April 2016 05:04 PM   [ Ignore ]   [ # 3 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi,

You can see what the constraints are if you monitor the pc. I would expect the CPU to be the constraint (if you run several at the same time).

How many 10-Ks are you processing and what do you mean by ‘long’? 1 day, 1 week, 1 month?

Best Regards,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 10 April 2016 05:12 PM   [ Ignore ]   [ # 4 ]
Newbie
Rank
Total Posts:  16
Joined  2013-03-17

Hi Joost,

The CPU is actually constrained, a little more than 50% are used currently. I am parsing all 10-Ks with filing years from 2005-2014 now. I am running the scripts since 10:30 AM this morning.

Right now I have got the results of three years back, but there are no results back for another three years since the morning (for running 6.5 hours). This is the part that I am concerned with, I am worried that it would take super long time to finish those three years.

Is this normal or I am too worried about this and should wait for longer time to check the results?

Thank you so much!!!

Best Regards

Profile
 
 
Posted: 10 April 2016 05:15 PM   [ Ignore ]   [ # 5 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi,

I usually scan only those 10-Ks for firms that are actually in my sample. (i.e., have data in Compustat, have CRSP, and IBES data, and I am able to match to CIK; that results in only scanning a subset.)

A day of scanning (24 hr) seems reasonable. I use sublime text editor that can read a file without ‘blocking’ it, so when you write results into a file that is a way to see how fast it is progressing.

Hope this helps,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 10 April 2016 05:40 PM   [ Ignore ]   [ # 6 ]
Newbie
Rank
Total Posts:  16
Joined  2013-03-17

Hi Joost,

Thanks! I actually should only scan those firms that in my sample as well and this should speed things up. Regarding this,  should I create a hash table of those CIKs in my sample in the codes and then only scan those filings with these CIKs or there is a more efficient way to do this?

Thanks again!

Best Regards

Profile
 
 
Posted: 10 April 2016 07:04 PM   [ Ignore ]   [ # 7 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi,

I usually create a file with a list of ids/filenames to scan. I import the results (that contain the id/filename), and then match it with the main sample.

Best Regards,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 10 April 2016 07:06 PM   [ Ignore ]   [ # 8 ]
Newbie
Rank
Total Posts:  16
Joined  2013-03-17

Hi Joost,

Thanks much!! I will try the same procedure.

Best Regards

Profile
 
 
Posted: 10 April 2016 07:10 PM   [ Ignore ]   [ # 9 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

Excellent! smile

Take care,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile