Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Eric Booth <ebooth@ppri.tamu.edu> |
To | "<statalist@hsphsun2.harvard.edu>" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: encode results in false match - merge/joinby |
Date | Thu, 10 Feb 2011 21:38:30 +0000 |
<> BTW, I like using Ben Jann's -fre- (from SSC) to examine values and value labels together. Try: ***** cap which fre if _rc ssc install fre, replace fre code1 ***** - Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu Office: +979.845.6754 On Feb 10, 2011, at 3:32 PM, Eric Booth wrote: > <> > > On Feb 10, 2011, at 3:07 PM, joe j wrote: >> I wonder if this strange behavior of encoded variables >> is limited only to 'join' or could it be an issue also in other >> contexts (?). Thanks for any pointers. > > This is expected behavior. -encode- is creating a numeric version of your string variable with value labels equivalent to the strings in the oldvar. > Your -joinby- results are unexpected (at least to you, not to Stata) only because you are looking at the value labels, not the values, and -merge-/-joinby-/etc use the values, not value labels to combine data. > > When you encode a string variable, Stata will assign values starting at 1 for the first obs (unless you use -encode-'s label option to change this). > Take a look at the values underlying the labels for your code1 variable by typing: > > ta code1 > ta code1, nol > *or* > browse, nolabel > > > See -help encode- for more detail on what -encode- is doing to your string variables. > > -Eric > __ > Eric A. Booth > Public Policy Research Institute > Texas A&M University > ebooth@ppri.tamu.edu > Office: +979.845.6754 > > > On Feb 10, 2011, at 3:07 PM, joe j wrote: > >> I just wanted to highlight something I encountered while merging two >> data sets with encoded merge variables . The two tables in reality are >> a perfect non-match. This is also the case when I use the matching >> variable 'code' in the string format. But if I encode them and >> generate a variable 'code1' and use that for merging there is a >> perfect match. (Now, I don't remember why I encoded this >> variable-there must have been a reason but that was definitely not >> aimed at merge.) >> >> Below is an example with two files being joined with string variable >> 'code' and encoded variable 'code1'--the latter results in a false >> perfect match. I wonder if this strange behavior of encoded variables >> is limited only to 'join' or could it be an issue also in other >> contexts (?). Thanks for any pointers. >> >> clear >> input id str5 code >> 1 "123J5" >> 2 "68741" >> 3 "297J5" >> 4 "14856" >> 5 "AB234" >> 6 "25K45" >> 7 "12535" >> end >> encode code, gen(code1) >> sort code1 >> save file1.dta, replace >> >> clear >> input id str5 code >> 1 "243J5" >> 2 "68348" >> 3 "479H5" >> 4 "467G5" >> 5 "23TUB" >> 6 "TU501" >> 7 "32LK8" >> end >> encode code, gen(code1) >> >> joinby code1 using file1.dta, unmatched(both) /*perfect match*/ >> *joinby code using file1.dta, unmatched(both) /*perfect non-match* >> >> ta _m * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/