3 of 7
3
SEC Filings on EDGAR SAS File
Posted: 19 March 2013 11:47 PM   [ Ignore ]   [ # 31 ]
Newbie
Rank
Total Posts:  3
Joined  2013-03-19

I was trying to use the download.pl to download some filings. However, the code needs c_10K_list.txt as input. Where can I find such file?

Thanks!

Profile
 
 
Posted: 20 March 2013 10:04 AM   [ Ignore ]   [ # 32 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi nwch,

See this tutorial that includes the download.pl as well as a SAS script that creates the c_10K_list.txt: http://www.wrds.us/index.php/tutorial/view/26

best regards,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 20 March 2013 05:20 PM   [ Ignore ]   [ # 33 ]
Newbie
Rank
Total Posts:  3
Joined  2013-03-19

Thank you for your reply Joost! My problem is I don’t have SAS. I just want to download some filings through PERL. What is the format of this c_10K_list.txt file? Could you give me some example records or an example file?

Thanks again!

Profile
 
 
Posted: 20 March 2013 06:42 PM   [ Ignore ]   [ # 34 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi nwch,

Like this:

downloadId,url,blank
1,edgar/data/1750/0000912057-00-039006.txt,0
2,edgar/data/1750/0000912057-01-530303.txt,0
3,edgar/data/1750/0000912057-02-033450.txt,0
4,edgar/data/1750/0000912057-94-002818.txt,0


best,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 20 March 2013 06:42 PM   [ Ignore ]   [ # 35 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

Yes, there is not ‘easy way’. This may be a good starting point though:

#!/usr/bin/perl
use LWP;

use 
HTML::StripScripts;
use 
HTML::Restrict;       

$dirIn "P:/research/projects/19_someproject/perl/10ks/";
$dirOut "P:/research/projects/19_someproject/perl/scan_out/";

opendir(DIR$dirIn);

foreach 
my $file (readdir(DIR))   {
    
            $i
++;
#uncomment next line for debugging
#          if ($i == 6) { last };
            
            
if($file =~ m/txt/) {
            
                        local
( $/, *FH ) ;
                        
openFH$dirIn $file ) or die "fatal error reading $file\n";
                        
$filing_raw = <FH>;

                        
# remove html tags
                        
my $hr HTML::Restrict->new(); 
      
      
#your magic goes here...
      
                        #write output to file
                        
open (MYFILE'>' $dirOut "_score" $file );
                        
      
#print your magic 
                        
close (MYFILE); 
            
}

best regards,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 20 March 2013 07:48 PM   [ Ignore ]   [ # 36 ]
Newbie
Rank
Total Posts:  3
Joined  2013-03-19

Thanks a lot Joost!

Profile
 
 
Posted: 01 April 2013 09:22 PM   [ Ignore ]   [ # 37 ]
Newbie
Rank
Total Posts:  1
Joined  2013-04-01

Hi Joost:

I’m completely new to perl.  But using the perl code to download 10K filings is fantastic. I just have one question. Why the size of the 10K text file is larger than the stated file size on SEC website? For example, a company 10-K “Complete Submission FIle” is 166543 bytes on SEC website. But after perl downloads the same file, it shows 169432 bytes on my computer. Almost every file downloaded is larger than the stated file size. I wonder what is it that I did incorrectly? Any help would be greatly appreciated.

Joost Impink - 20 March 2013 06:42 PM

hi nwch,

Like this:

downloadId,url,blank
1,edgar/data/1750/0000912057-00-039006.txt,0
2,edgar/data/1750/0000912057-01-530303.txt,0
3,edgar/data/1750/0000912057-02-033450.txt,0
4,edgar/data/1750/0000912057-94-002818.txt,0


best,

Joost

Profile
 
 
Posted: 03 April 2013 08:10 PM   [ Ignore ]   [ # 38 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi newnewbie,

That is curious… I must admit that I have actually never verified the size.

On windows, there are 2 sizes when you get the file properties (actual size, and size on disk). Maybe this explains the difference. If it doesn’t then I would suggest a manual download through the SEC’s website of a few files and compare it with the perl downloaded file. There is software that can compare two versions of a file (for example Microsoft Word).

Hope this helps,

Joost

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 04 September 2013 10:12 AM   [ Ignore ]   [ # 39 ]
Newbie
Rank
Total Posts:  10
Joined  2013-09-04

Hi,

Profile
 
 
Posted: 11 January 2014 10:05 AM   [ Ignore ]   [ # 40 ]
Newbie
Rank
Total Posts:  9
Joined  2014-01-11

Greetings Joost!

I am complete newbie when it comes to SAS and PERL (and programming in general), but I’ve followed your wonderful code and it’s led me a LONG way (http://www.wrds.us/index.php/tutorial/view/26). I am now at the stage where PERL has successfully downloaded all the 10K’s I requested in the SAS code. However, what PERL is doing is downloading the files into the temp folder with the file names: 1, 2, 3, 4, 5… etc… Because I am downloading ~5000 10K’s and using WordStat to content analyze each of these, I need the file names to be more descriptive so I know what words are appearing in each respective 10K (i.e., 1, 2, or 3 means nothing to me). What would be ideal, I suppose, is if the file names included the company’s CIK code and the year of the 10K. For example “0000912057-94-94.text” for a 1994 10K from company 0000912057-94.

Some lines from my SAS text file that PERL is extracting from include:

downloadId,url,blank
1,edgar/data/1800/0000912057-94-000771.txt,0
2,edgar/data/1800/0000912057-95-001314.txt,0
3,edgar/data/1800/0000912057-96-004299.txt,0

I’ve used the exact PERL code included in the link above in this post. Not sure if I can tweak that code to assign a more descriptive name to each file automatically when it is downloaded.

Again, thank you for your tremendous help! The effort you exert to help the rest of us through this (painful) process is very much appreciated! Like I said, I’m a newb, so this is great!

Profile
 
 
Posted: 11 January 2014 08:24 PM   [ Ignore ]   [ # 41 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi mark,

I am glad to hear this website is helpful.

About the naming of files, you can indeed have any filename that you would like.

In this line the ‘filename’ is set:

$filename $write_dir "/" $CIK ".txt"

The $CIK is a bit misleading, because this is the simple number (1, 2, 3, etc). If you would like to have the more ‘official’ name (like 0000912057-94-000771.txt), you can find the last occurrence of “/” in $get_file (which will hold stuff like “edgar/data/1800/0000912057-94-000771.txt” and retrieve the remainder, in this example “0000912057-94-000771.txt”)

you can do that like this:
first find last occurence (using rindex), then get substring from that position

$lastSlash rindex ($get_file'/');
$longname substr $get_file $lastSlash;
$filename $write_dir "/" $longname

(code needs testing though)

It may not help that much, I think. If you have the SAS dataset that holds the ‘1, 2, 3, etc’ as well as the longer urls, then you could add the Wordstat score, like this:
1. start with the SAS dataset that you already have
2. download the files (keep them 1.txt, 2.txt, etc)
3. do your Wordstat thing and generate a dataset based on 1.txt, 2.txt
4. import your Wordstat dataset into SAS
5. merge with the starting dataset (based on 1, 2, ..)

Put differently, a ‘key’ like ‘0000912057-94-000771.txt’ may be harder to manage than a numerical key.

Hope this helps; good luck with your Perl/Wordstat adventure smile

Joost

 

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 13 January 2014 08:00 PM   [ Ignore ]   [ # 42 ]
Newbie
Rank
Total Posts:  9
Joined  2014-01-11

Hi Joost,

The code you provided didn’t work and I’m not sure why as the command window simply pops up then goes away immediately when I try to execute the file. Like I said in my previous post, I’m new to programming so I don’t know where to begin to deduce the problem.

Nonetheless, you provided a good alternative reason to keep the file names as simple integers; I may end up doing just this as I can see how I would use SAS to match the file names with companies’ 10-Ks. Thank you for these additional thoughts!

Again, I want to thank you for providing a quick reply to my question. You have no idea how much time this forum and your previous code saved me! This was a wonderful resource!

Mark

Profile
 
 
Posted: 13 January 2014 08:20 PM   [ Ignore ]   [ # 43 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi Mark,

I just realized that this code

$lastSlash rindex ($get_file'/');
$longname substr $get_file $lastSlash;
$filename $write_dir "/" $longname

probably needs a plus 1, like this:

$lastSlash rindex ($get_file'/') + 1;
$longname substr $get_file $lastSlash;
$filename $write_dir "/" $longname

$lastSlash gives the position of the last ‘/’, and you want to get the filename, which is 1 position later.

In general, scanning a few files and printing things to the screen would help find bugs. For example, printing $filename to the screen would help to see if that variable makes sense or not. The ‘extra’ ‘/’ would then probably show up, etc.

best regards,

Joost

 

 

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 16 January 2014 03:06 PM   [ Ignore ]   [ # 44 ]
Newbie
Rank
Total Posts:  9
Joined  2014-01-11

Hey Joost,

Once again, thank you! The code really helps. However, as it turns out, I think I’m going to leave the file names as numbers as you put forth a good point in your previous point about being able to match the files anyway and the ease of having a simpler file name. I do appreciate this though!

Now having the 4000 10-Ks, I’m questioning my decision to use WordStat for the content analysis due to the time it will take me to feed all the files through that program (and then having to go back and do it again if I miss a word in the initial analysis). Do you have any thoughts on using PERL for content analysis? Do you have the template code necessary to get started? If not then not to worry at all; I just thought I’d ask as clearly you’re FAR beyond my level when it comes to this stuff!

Thanks!
Mark

Profile
 
 
Posted: 16 January 2014 08:05 PM   [ Ignore ]   [ # 45 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi Mark,

I indeed do have some perl scripts for matching, but I would think a bit of googling would get you higher quality scripts.

Did you try WordStat? I would think scanning 4,000 10-Ks wouldn’t take too much time. In any case, I wouldn’t think Perl would be much faster. Perl code is rather hard to write, especially pattern matching code, which you would have to write yourself (WordStat may ‘help’ you with that, or have some sort of interface to specify what to match on). One thing to realize though is that many 10-K filings (especially recent ones) are 95% html markup, which you probably need to remove. (Can WordStat do that for you?)

best regards,

Joost

 

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
   
3 of 7
3