SAS SPEDIS function
Posted: 20 September 2011 09:23 PM   [ Ignore ]
Sr. Member
RankRankRankRank
Total Posts:  169
Joined  2011-09-20

Joost,

in example 2:

data myLib2.c_verifyThese (keep = permno COMNAM firmname);

set myLib2.b_withcusip;

name_dist = spedis(lowcase(firmname), lowcase(comnam));

if name_dist >= 30;

run;

This code will drop any name_dist <=30, right?
However, the name_dist <=30 are the good matched one, we should keep them instead of drop them, am i right?

In anthor word, i think we should have
if name_dist<=30, since if here will drop any data with spedis score >30.

Thanks,
Zenghui

 Signature 

Zenghui
A humble student of business

Profile
 
 
Posted: 20 September 2011 09:49 PM   [ Ignore ]   [ # 1 ]
Sr. Member
RankRankRankRank
Total Posts:  169
Joined  2011-09-20

Joost,

a follow up question.

if i have two string,

ABC lab and ABC laboratoray

if you use spedis, the num_dist is high, but actually they are perfect match.

do you think any other function can do a better job here?

Thanks,
Zenghui

Update: if we truncate both string to the same length, then compare, this problem could be solved.
However, if i have two strings:
ABC inc and inc ABC, they are perfect match too, i guess the num_dist is high too… any idea to solve this problem?

Thanks!

 Signature 

Zenghui
A humble student of business

Profile
 
 
Posted: 21 September 2011 08:55 AM   [ Ignore ]   [ # 2 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  901
Joined  2011-09-19

hi Zenghui,

A lower score indeed means a better match. In the data step below I keep the observations that are suspect to be different (c_verifyThese). In other words, I want to keep to ‘bad’ matches to manually examine them.

data myLib2.c_verifyThese (keep = permno COMNAM firmname);
set myLib2.b_withcusip;
name_dist = spedis(lowcase(firmname), lowcase(comnam));
if name_dist >= 30;run;

I do this expection in Excel. Then, I continue by dropping the 6 firms that I know do not match.

data myLib2.d_verified (drop = NAMEDT NAMEENDT);
set myLib2.b_withcusip;
if permno ne 49429;
if permno ne 80142;
if permno ne 81547;
if permno ne 85654;
if permno ne 89303;
if permno ne 90173;
run;

So, d_verified contains the observations that have a Spedis score of 30 or better, and firms that have manually verified.

Obviously, manual inspection has its limitations. When the sample gets ‘too’ large this is no longer feasible. (But then, with a large sample dropping observations is less of an issue)  Also, a company can change its name and be dropped as a result even though it is a valid match.

A tool to use for matching by firm name is ‘Data Match Tool’, see http://www.datamatchtool.com/ It is. a commercial package, but with discounts for students/universities.

best regards,

Joost

 

 Signature 

To reply/post new questions: Please use the group WRDS/SAS on Google Groups! http://groups.google.com/d/forum/wrdssas

Profile
 
 
Posted: 21 September 2011 08:38 PM   [ Ignore ]   [ # 3 ]
Sr. Member
RankRankRankRank
Total Posts:  169
Joined  2011-09-20

Joost,

I got you, that is very clear to me now.

Cheers,
Zenghui

 Signature 

Zenghui
A humble student of business

Profile
 
 
Posted: 03 March 2012 07:27 AM   [ Ignore ]   [ # 4 ]
Sr. Member
RankRankRankRank
Total Posts:  169
Joined  2011-09-20

* an example to match two database with spedis and scan function.

proc sql;
create table matched_list as
select a.*,b.*
from
work.part_list a
LEFT JOIN
work.full_company_list b
on
spedis(scan(a.companyname,1),scan(b.comn,1))<10
and
spedis(scan(a.companyname,2),scan(b.comn,2))<30;
quit;

data matched2__list;
set matched_list;
match_score = spedis (auditor,auditor_name);
run;

Pretty good matched results!

 Signature 

Zenghui
A humble student of business

Profile
 
 
   
 
 
     ROE example ››