>
>
Thanks for the tip Martin.
I had someone email me off-list about the -reclink- line in the example I created not working. The problem was that I forgot to consider line wrapping in the email for this part of the code:
reclink var1 using "datasetusing.dta", idmaster(idmaster) idusing(idusing)
gen(_match) minscore(.75)
should be one line.
So, you could unwrap it in your do-file editor, or replace it with:
reclink var1 using "datasetusing.dta", idmaster(idmaster) /*
*/ idusing(idusing) gen(_match) minscore(.75)
to get it to work.
~ Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754
On Dec 1, 2009, at 9:52 AM, Martin Weiss wrote:
>
> <>
>
>
>
> *************
> replace var1 = ltrim(var1)
> replace var1 = rtrim(var1)
> *************
>
>
> could be telescoped into - replace var1 = trim(var1)-
>
>
> HTH
> Martin
>
> -----Ursprüngliche Nachricht-----
> Von: [email protected]
> [mailto:[email protected]] Im Auftrag von Eric Booth
> Gesendet: Dienstag, 1. Dezember 2009 16:53
> An: [email protected]
> Betreff: Re: st: RE: Merging by string variables
>
>>
>>
>
> Also, I should add a note of warning that -reclink- may help with some
> approximate matching, but you really need to do some clean-up of the string
> variables, as Nick suggests.
>
>> From the short snippet of your dataset, it looks like the strings you are
> using to merge are similar in the first word(s) and the first word(s) of
> each matched pair are distinct from other matched pairs. So, you might be
> able to get -reclink- to save you some work if you first truncate the
> strings to the first 10 or so characters, trim spaces from your strings,
> and change all your characters to lowercase. Depending on what the rest of
> your dataset looks like, this may not get you too close to 100% matching,
> but at least it might help flag the strings that you need clean up by hand:
>
>
> *****
> clear
> cap ssc install reclink
> **
>
> input str44 var1 var2
> "68th precinct youth council inc " 1
> "action center for education and community development, inc " 2
> "amistad child day care and family center inc" 3
> end
>
> replace var1 = lower(var1)
> replace var1 = ltrim(var1)
> replace var1 = rtrim(var1)
> replace var1 = substr(var1, 1, 10)
> sort var1
> gen idusing = _n
> save "datasetusing.dta", replace
>
>
> clear
> input str44 var1 var3
> " 68th precinct youth council, inc." 3
> "action center for education and community development-ps 106" 5
> " amistad early childhood educational center inc" 5
> end
>
> replace var1 = lower(var1)
> replace var1 = ltrim(var1)
> replace var1 = rtrim(var1)
> replace var1 = substr(var1, 1, 10)
> sort var1
> gen idmaster = _n
> save "datasetmaster.dta", replace
>
> //merge with reclink//
> reclink var1 using "datasetusing.dta", idmaster(idmaster) idusing(idusing)
> gen(_match) minscore(.75)
> li var1 Uvar1 _match
>
> *******
>
>
> Eric
>
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> [email protected]
> Office: +979.845.6754
>
>
>
> On Dec 1, 2009, at 9:15 AM, Nick Cox wrote:
>
>> There's no easy and failsafe solution here.
>>
>> -merge- doesn't know about meanings or approximate matches. It's entirely
> literal.
>>
>> I can think of two strategies.
>>
>> 1. You just need to work on one or indeed both datasets to produce
> variables that will merge. There's detailed advice within
>>
>> SJ-8-3 dm0039 . . . Stata tip 64: Cleaning up user-entered string
> variables
>> . . . . . . . . . . . . . . . . . . . . . . . . J. Herrin and E.
> Poen
>> Q3/08 SJ 8(3):444--445 (no
> commands)
>> tip on how to clean up user-entered string variables
>>
>> 2. You could try soundex or similar tricks. Your example doesn't look
> encouraging for that strategy.
>>
>> Nick
>> [email protected]
>>
>> Meryle Weinstein, Ph.D.
>>
>> I have two datasets that Im trying to merge by the following string
>> variables: agencyname sitename siteaddress. There are slight differences
>> in the datasets, particularly in the agencyname and sitename variables so
>> I'm having trouble merging the two datasets. The problem seems to be that
>> the agencyname differs slightly in each of the datasets. For example
>>
>> Dataset2 dataset2
>> 68th precinct youth council inc 68th precinct
>> youth council, inc.
>> action center for education and community development, inc action center
>> for education and community development-ps 106
>> amistad child day care and family center inc amistad early
>> childhood educational center inc
>>
>> Any suggestions on a way to merge by these three variables would be
>> appreciated.
>>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/