Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: match variable across two tables


From   Robert Picard <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: RE: match variable across two tables
Date   Thu, 19 Dec 2013 12:22:43 -0500

No need to talk about "fuzzy" matching as NAISC codes are defined
hierarchically. If you do not match at the 6-digit level, you can try
again using 5-digit codes, and so on.

Your first problem is to reshape Table 1 data from wide to long
format. Your "I-O number codes" are clearly not valid NAISC codes so
the target becomes creating a crosswalk between valid NAICS to "I-O
number codes".

Once you have the crosswalk, you can do an exact match using -merge-.
For all NAICS code that did not find an exact match, you can do an
update merge to find matching "I-O numbers" using 5-digit NAISC codes.
You can then repeat down to 2-digit NAICS if you want to.

Robert

* ----------------- begin example ------------------------
clear
input str6 ionumber str244 ioname str244 codelist
"1110" "Crop production"
"1111A0" "Oilseed farming" "11111 11112"
"1111B0" "Grain farming" "11113 11114 11115 11116 11119"
"111200" "Vegetable and melon farming" "1112"
"111400" "Greenhouse and nursery production" "1114"
"111910" "Tobacco farming" "11191"
"111920" "Cotton farming" "11192"
"1119A0" "Sugarcane and sugar beet" "11193 111991"
"1119B0" "All other crop farming" "11194 111992 111998"
end
compress

* split into separate codes and reshape long
split codelist, gen(naics)
reshape long naics, i(ionumber) j(code) string

* drop obs with missing codes
bysort ionumber (code): drop if mi(naics) & _n > 1
replace naics = ionumber if mi(naics)

* remove trailing zeros
replace naics = regexr(naics,"0+$","")

* save naics to ionumber crosswalk
isid naics, sort
list, noobs sepby(ionumber)
tempfile table1
save "`table1'"

clear
input str6 naics
"111"
"1111"
"111150"
"111199"
"111219"
"111310"
"111320"
"111332"
"111334"
"111335"
"111339"
"1114"
"111411"
"111419"
"111421"
"111422"
"111920"
"111930"
"111940"
"111998"
end
gen table2id = _n
replace naics = regexr(naics,"0+$","")

* do an exact match using the crosswalk
merge 1:1 naics using "`table1'", keepusing(ionumber) ///
keep(master match) nogen

* for obs that did not match, try again using 5 digits.
clonevar naics6 = naics
replace naics = substr(naics6,1,5)
merge m:1 naics using "`table1'", keepusing(ionumber) ///
update gen(merge5)
drop if merge5 == 2

* repeat for 4-digit naics
replace naics = substr(naics6,1,4)
merge m:1 naics using "`table1'", keepusing(ionumber) ///
update gen(merge4)
drop if merge4 == 2

* repeat for 3-digit naics
replace naics = substr(naics6,1,3)
merge m:1 naics using "`table1'", keepusing(ionumber) ///
update gen(merge3)
drop if merge3 == 2
* --------------------------- end example ---------------


On Wed, Dec 18, 2013 at 9:52 PM, Rongrong Zhang <[email protected]> wrote:
> Hi Sarah,
>
> Thanks so much for your questions.  Let me try to answer them in the
> order they were posted.
>
> Yes, I plan to drop trailing zeros and take all the nonzero digits as
> match criteria. In this case, you are correct in terms of - I need
> processing the data first. - should I use trim ()?
>
> your next question: the structure of data in table 1: do I have a
> single variable that has multiple codes in it. I assume you are
> asking:
>
> e.g 1111B0    Grain farming    corresponds to 5 different NAICS code
>  and they are    11113      11114      11115 11116      11119.
>
> suppose all these 5 NAICS codes are present in my Table 2, I would
> like to have 5 rows in my final output table like this:
>
> 1111B0   11113
> 1111B0   11114
> 1111B0   11115
> 1111B0   11116
> 1111B0   11119
>
> next question : the rule that make an entry a match. If I require 5 or
> 6 digit match, then these two tables may not produce many matches.
> that is why I thought of 4 digit matches. Ideally I would like to do
> both exact and "fuzzy" match e.g. using 4 digit, so I have the
> flexibility to control my sample size.
>
> If you or others have questions or suggestions, please let me know.
>
> thanks,
>
> On Wed, Dec 18, 2013 at 3:05 PM, Sarah Edgington <[email protected]> wrote:
>> Rochelle,
>> This looks like it may be a pretty complicated problem.  I don't immediately have any suggestions because I'm not sure I understand either the exact structure of your data or the matching rules you want to follow.
>>
>> You say that if you use exact matching that you want I-O number 1111B0 to match with NAICS code 111150.  I take it that is an "exact match" because you want to drop the trailing zero in the NAICS code.  So, since 11115 appears in the list of NAICS codes for 1111B0, it would match to 111150 in table 2.  This is not to my mind an "exact match" because it requires first modifying the NAICS code in table 2 before you can match.  To do that successfully you need to be very clear about what the rule for modification is.
>> Is the rule that if the NAICS code in table 2 has a zero at the end you always drop it?  Does it matter how many digits appear before the zero?
>>
>> The next question I have is about the structure of table 1 as it appears in Stata.  Do you have a single variable that has multiple codes in it?  If so, you're probably going to have to do some additional processing to that variable before trying to match the two tables.
>>
>> The final thing I was unclear on was what you want the final structure of your data to be after matching.  How do you want to deal with entries in table 1 that have multiple matches in table 2?  Do you want the resulting data to contain multiple observations, one for each of the NAICS codes that the I-O number matches to?
>>
>> Again for the four digit match, you'll want to be very clear on the rules that make an entry a match.  I'm not sure if you're asking for a match of the first four digits of the NAICS code in table 1 to only the codes in table 2 that are four digits long.  Alternatively perhaps you're looking to match observation in table 1 to ALL the entries in table 2 that share the same first four digits.
>>
>> If you can more precisely describe the structure of your data as it currently exists, the matching rules you want to follow, and the structure you want your final data to be in, you'll increase your chances of getting a helpful answer from the list.
>>
>> -S
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Rongrong Zhang
>> Sent: Wednesday, December 18, 2013 11:15 AM
>> To: [email protected]
>> Subject: st: match variable across two tables
>>
>> Dear STATALISTER,
>>
>> I have two tables:
>>
>> Table 1 has 3 variables  I-O number, I-O Name ,   Related 1997 NAICS codes.
>>
>> Table 2 has 1 variable 1997 NAICS codes.
>>
>> I want to link these two tables based on NAICS code. However, the
>> level of details on NAICS code does not match one-to-one because the
>> tables come from different data source. My goal is to know which NAICS
>> code correspond to which I-O number. I can’t use Table 1 only, because
>> TABLE 2 is produced from Wharton Research Database which has company
>> level financial data – I will use later on.
>>
>> By different details I mean : e.g.
>>
>> table 1:
>>
>> I-O number  I-O Name                  1997 NAICS codes
>>
>> 1110        Crop production
>>
>> 1111A0    Oilseed farming
>>          11111      11112
>>
>> 1111B0    Grain farming             11113      11114      11115
>> 11116      11119
>>
>> 111200    Vegetable and melon farming
>>                    1112
>> 111400 Greenhouse and nursery production
>>                 1114
>> 111910 Tobacco farming
>>                 11191
>> 111920 Cotton farming
>>                 11192
>> 1119A0 Sugarcane and sugar beet
>>       11193 111991
>> 1119B0 All other crop farming
>> 11194 111992 111998
>>
>> in the above example, I present industry 1110 and its subindustries
>> 1111A0, 1111B0, 111200, each of the subindustries correspond to a few
>> (or a single) NAICS code (north america industry classification
>> system).
>>
>> table 2:
>> NAICS CODE
>> 111
>> 1111
>> 111150
>> 111199
>> 111219
>> 111310
>> 111320
>> 111332
>> 111334
>> 111335
>> 111339
>> 1114
>> 111411
>> 111419
>> 111421
>> 111422
>> 111920
>> 111930
>> 111940
>> 111998
>>
>> if I enforce exact match, then table 2  111150 matches with table 1
>> 1111B0,    table 2 1112l9 may be matched with  111200 table 1 I-O.
>>
>> My question :
>> 1.could you give a sample code/function to do exact match? note, if
>> first 5digit match, and drop last 0 (naics), we consider that a match
>> 2. if I want to increase match, how could I change the program to do 4
>> digit match
>>
>> thanks a bunch,
>>
>> --
>> Best,
>> Rochelle
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> -Best,
> R
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index