Thank you for your reply Joost! My problem is I don’t have SAS. I just want to download some filings through PERL. What is the format of this c_10K_list.txt file? Could you give me some example records or an example file?
I’m completely new to perl. But using the perl code to download 10K filings is fantastic. I just have one question. Why the size of the 10K text file is larger than the stated file size on SEC website? For example, a company 10-K “Complete Submission FIle” is 166543 bytes on SEC website. But after perl downloads the same file, it shows 169432 bytes on my computer. Almost every file downloaded is larger than the stated file size. I wonder what is it that I did incorrectly? Any help would be greatly appreciated.
That is curious… I must admit that I have actually never verified the size.
On windows, there are 2 sizes when you get the file properties (actual size, and size on disk). Maybe this explains the difference. If it doesn’t then I would suggest a manual download through the SEC’s website of a few files and compare it with the perl downloaded file. There is software that can compare two versions of a file (for example Microsoft Word).
I am complete newbie when it comes to SAS and PERL (and programming in general), but I’ve followed your wonderful code and it’s led me a LONG way (http://www.wrds.us/index.php/tutorial/view/26). I am now at the stage where PERL has successfully downloaded all the 10K’s I requested in the SAS code. However, what PERL is doing is downloading the files into the temp folder with the file names: 1, 2, 3, 4, 5… etc… Because I am downloading ~5000 10K’s and using WordStat to content analyze each of these, I need the file names to be more descriptive so I know what words are appearing in each respective 10K (i.e., 1, 2, or 3 means nothing to me). What would be ideal, I suppose, is if the file names included the company’s CIK code and the year of the 10K. For example “0000912057-94-94.text” for a 1994 10K from company 0000912057-94.
Some lines from my SAS text file that PERL is extracting from include:
I’ve used the exact PERL code included in the link above in this post. Not sure if I can tweak that code to assign a more descriptive name to each file automatically when it is downloaded.
Again, thank you for your tremendous help! The effort you exert to help the rest of us through this (painful) process is very much appreciated! Like I said, I’m a newb, so this is great!
About the naming of files, you can indeed have any filename that you would like.
In this line the ‘filename’ is set:
$filename = $write_dir . "/" . $CIK . ".txt";
The $CIK is a bit misleading, because this is the simple number (1, 2, 3, etc). If you would like to have the more ‘official’ name (like 0000912057-94-000771.txt), you can find the last occurrence of “/” in $get_file (which will hold stuff like “edgar/data/1800/0000912057-94-000771.txt” and retrieve the remainder, in this example “0000912057-94-000771.txt”)
you can do that like this:
first find last occurence (using rindex), then get substring from that position
It may not help that much, I think. If you have the SAS dataset that holds the ‘1, 2, 3, etc’ as well as the longer urls, then you could add the Wordstat score, like this:
1. start with the SAS dataset that you already have
2. download the files (keep them 1.txt, 2.txt, etc)
3. do your Wordstat thing and generate a dataset based on 1.txt, 2.txt
4. import your Wordstat dataset into SAS
5. merge with the starting dataset (based on 1, 2, ..)
Put differently, a ‘key’ like ‘0000912057-94-000771.txt’ may be harder to manage than a numerical key.
Hope this helps; good luck with your Perl/Wordstat adventure
The code you provided didn’t work and I’m not sure why as the command window simply pops up then goes away immediately when I try to execute the file. Like I said in my previous post, I’m new to programming so I don’t know where to begin to deduce the problem.
Nonetheless, you provided a good alternative reason to keep the file names as simple integers; I may end up doing just this as I can see how I would use SAS to match the file names with companies’ 10-Ks. Thank you for these additional thoughts!
Again, I want to thank you for providing a quick reply to my question. You have no idea how much time this forum and your previous code saved me! This was a wonderful resource!
$lastSlash gives the position of the last ‘/’, and you want to get the filename, which is 1 position later.
In general, scanning a few files and printing things to the screen would help find bugs. For example, printing $filename to the screen would help to see if that variable makes sense or not. The ‘extra’ ‘/’ would then probably show up, etc.
Once again, thank you! The code really helps. However, as it turns out, I think I’m going to leave the file names as numbers as you put forth a good point in your previous point about being able to match the files anyway and the ease of having a simpler file name. I do appreciate this though!
Now having the 4000 10-Ks, I’m questioning my decision to use WordStat for the content analysis due to the time it will take me to feed all the files through that program (and then having to go back and do it again if I miss a word in the initial analysis). Do you have any thoughts on using PERL for content analysis? Do you have the template code necessary to get started? If not then not to worry at all; I just thought I’d ask as clearly you’re FAR beyond my level when it comes to this stuff!
I indeed do have some perl scripts for matching, but I would think a bit of googling would get you higher quality scripts.
Did you try WordStat? I would think scanning 4,000 10-Ks wouldn’t take too much time. In any case, I wouldn’t think Perl would be much faster. Perl code is rather hard to write, especially pattern matching code, which you would have to write yourself (WordStat may ‘help’ you with that, or have some sort of interface to specify what to match on). One thing to realize though is that many 10-K filings (especially recent ones) are 95% html markup, which you probably need to remove. (Can WordStat do that for you?)