Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: encode results in false match - merge/joinby
From
Eric Booth <[email protected]>
To
"<[email protected]>" <[email protected]>
Subject
Re: st: encode results in false match - merge/joinby
Date
Thu, 10 Feb 2011 21:32:22 +0000
<>
On Feb 10, 2011, at 3:07 PM, joe j wrote:
> I wonder if this strange behavior of encoded variables
> is limited only to 'join' or could it be an issue also in other
> contexts (?). Thanks for any pointers.
This is expected behavior. -encode- is creating a numeric version of your string variable with value labels equivalent to the strings in the oldvar.
Your -joinby- results are unexpected (at least to you, not to Stata) only because you are looking at the value labels, not the values, and -merge-/-joinby-/etc use the values, not value labels to combine data.
When you encode a string variable, Stata will assign values starting at 1 for the first obs (unless you use -encode-'s label option to change this).
Take a look at the values underlying the labels for your code1 variable by typing:
ta code1
ta code1, nol
*or*
browse, nolabel
See -help encode- for more detail on what -encode- is doing to your string variables.
-Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754
On Feb 10, 2011, at 3:07 PM, joe j wrote:
> I just wanted to highlight something I encountered while merging two
> data sets with encoded merge variables . The two tables in reality are
> a perfect non-match. This is also the case when I use the matching
> variable 'code' in the string format. But if I encode them and
> generate a variable 'code1' and use that for merging there is a
> perfect match. (Now, I don't remember why I encoded this
> variable-there must have been a reason but that was definitely not
> aimed at merge.)
>
> Below is an example with two files being joined with string variable
> 'code' and encoded variable 'code1'--the latter results in a false
> perfect match. I wonder if this strange behavior of encoded variables
> is limited only to 'join' or could it be an issue also in other
> contexts (?). Thanks for any pointers.
>
> clear
> input id str5 code
> 1 "123J5"
> 2 "68741"
> 3 "297J5"
> 4 "14856"
> 5 "AB234"
> 6 "25K45"
> 7 "12535"
> end
> encode code, gen(code1)
> sort code1
> save file1.dta, replace
>
> clear
> input id str5 code
> 1 "243J5"
> 2 "68348"
> 3 "479H5"
> 4 "467G5"
> 5 "23TUB"
> 6 "TU501"
> 7 "32LK8"
> end
> encode code, gen(code1)
>
> joinby code1 using file1.dta, unmatched(both) /*perfect match*/
> *joinby code using file1.dta, unmatched(both) /*perfect non-match*
>
> ta _m
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/