Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AW: st: RE: Merging by string variables


From   Eric Booth <[email protected]>
To   [email protected]
Subject   Re: AW: st: RE: Merging by string variables
Date   Tue, 1 Dec 2009 10:21:47 -0600

>
>

Thanks for the tip Martin.  
I had someone email me off-list about the -reclink- line in the example I created not working.  The problem was that I forgot to consider line wrapping in the email for this part of the code:

reclink var1 using "datasetusing.dta", idmaster(idmaster) idusing(idusing)
gen(_match) minscore(.75)

	 should be one line.  
	So, you could unwrap it in your do-file editor, or replace it with:

reclink var1 using "datasetusing.dta", idmaster(idmaster) /*  
*/  idusing(idusing) gen(_match) minscore(.75)

  	 to get it to work.  


~ Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754



On Dec 1, 2009, at 9:52 AM, Martin Weiss wrote:

> 
> <> 
> 
> 
> 
> *************
> replace var1 = ltrim(var1)
> replace var1 = rtrim(var1)
> *************
> 
> 
> could be telescoped into - replace var1 = trim(var1)-
> 
> 
> HTH
> Martin
> 
> -----Ursprüngliche Nachricht-----
> Von: [email protected]
> [mailto:[email protected]] Im Auftrag von Eric Booth
> Gesendet: Dienstag, 1. Dezember 2009 16:53
> An: [email protected]
> Betreff: Re: st: RE: Merging by string variables
> 
>> 
>> 
> 
> Also, I should add a note of warning that -reclink- may help with some
> approximate matching, but you really need to do some clean-up of the string
> variables, as Nick suggests.
> 
>> From the short snippet of your dataset, it looks like the strings you are
> using to merge are similar in the first word(s) and the first word(s) of
> each matched pair are distinct from other matched pairs.  So, you might be
> able to get -reclink- to save you some work if you first truncate the
> strings to the first 10 or so characters, trim spaces from your strings,
> and change all your characters to lowercase.  Depending on what the rest of
> your dataset looks like, this may not get you too close to 100% matching,
> but at least it might help flag the strings that you need clean up by hand:
> 
> 
> *****
> clear
> cap ssc install reclink
> **
> 
> input str44 var1     var2                                               
> "68th precinct youth council inc   "    1                      
> "action center for education and community development, inc "  2
> "amistad child day care and family center inc"      3          
> end
> 
> replace var1 = lower(var1)
> replace var1 = ltrim(var1)
> replace var1 = rtrim(var1)
> replace var1 = substr(var1, 1, 10)
> sort var1
> gen idusing = _n
> save "datasetusing.dta", replace
> 
> 
> clear
> input str44 var1 var3
> " 68th precinct youth council, inc."  3
> "action center for education and community development-ps 106"  5
> " amistad early childhood educational center inc"  5
> end
> 
> replace var1 = lower(var1)
> replace var1 = ltrim(var1)
> replace var1 = rtrim(var1)
> replace var1 = substr(var1, 1, 10)
> sort var1
> gen idmaster = _n
> save "datasetmaster.dta", replace
> 
> //merge with reclink//
> reclink var1 using "datasetusing.dta", idmaster(idmaster) idusing(idusing)
> gen(_match) minscore(.75)
> li var1 Uvar1 _match
> 
> *******
> 
> 
> Eric
> 
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> [email protected]
> Office: +979.845.6754
> 
> 
> 
> On Dec 1, 2009, at 9:15 AM, Nick Cox wrote:
> 
>> There's no easy and failsafe solution here. 
>> 
>> -merge- doesn't know about meanings or approximate matches. It's entirely
> literal. 
>> 
>> I can think of two strategies. 
>> 
>> 1. You just need to work on one or indeed both datasets to produce
> variables that will merge. There's detailed advice within 
>> 
>> SJ-8-3  dm0039  . . .  Stata tip 64: Cleaning up user-entered string
> variables
>>       . . . . . . . . . . . . . . . . . . . . . . . .  J. Herrin and E.
> Poen
>>       Q3/08   SJ 8(3):444--445                                 (no
> commands)
>>       tip on how to clean up user-entered string variables
>> 
>> 2. You could try soundex or similar tricks. Your example doesn't look
> encouraging for that strategy. 
>> 
>> Nick 
>> [email protected] 
>> 
>> Meryle Weinstein, Ph.D.
>> 
>> I have two datasets that Im trying to merge by the following string
>> variables:  agencyname sitename siteaddress.  There are slight differences
>> in the datasets, particularly in the agencyname and sitename variables so
>> I'm having trouble merging the two datasets. The problem seems to be that
>> the agencyname differs slightly in each of the datasets.  For example 
>> 
>> Dataset2                                                     dataset2
>> 68th precinct youth council inc                              68th precinct
>> youth council, inc.
>> action center for education and community development, inc   action center
>> for education and community development-ps 106
>> amistad child day care and family center inc                 amistad early
>> childhood educational center inc
>> 
>> Any suggestions on a way to merge by these three variables would be
>> appreciated.
>> 



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index