Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Clyde Schechter" <clyde.schechter@einstein.yu.edu> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: Re: st: encode results in false match - merge/joinby |
Date | Fri, 11 Feb 2011 09:04:45 -0800 |
<> "I wonder if this strange behavior of encoded variables is limited only to 'join' or could it be an issue also in other contexts (?)." The primary question about merge/join has been answered by others. The general observation that encode produces a numeric variable based on the levels of the string variable observed in the data set, labeled to look like the original string variable leads to the following conclusion: Using -encode- on the same variable in multiple data sets that will later be combined (by any operation, e.g. -append-) is dangerous. Having been bitten by this many times, I have now developed some precautionary data management practices. 1. There are certain types of variables that recur frequently in my work. For many of these I have developed a standard encoding that I always use. The code to create these standard value labels is immortalized in some do-files that I routinely either -do-, -run- or -include- in my data-set creation do files. (I've even thought of including them in my profile.do, but decided that was a bit much.) These value labels cover all the possible values these variables can take. Whenever I -encode- one of these variables, I always explicitly use the label() option with these labels. 2. In large projects that will involve multiple data sets with overlapping variables not part of my "standard" list, whenever I use -encode-, I routinely follow that up with a -label save- to immortalize that particular encoding. In later work with the same variable in other data sets, before I -encode-, I -do-, -run-, or -include- the corresponding labeling do-file, and then use the explicit label() option in the -encode- command. If -encode- finds new levels of the variable not already in the label, it adds them to the label. And I follow that up using -label save, replace- again so my labeler do-file remains up-to-date. 3. With regard to #2, so I do not rely on my memory as to whether I have previously developed a labeling for a variable, my practice for these non-routine variables is to give the value label the same name as the variable, and name the labeler do-file varname_label.do. Then, when I want to -encode- such a variable, I precede the -encode- with -capture run varname_label.do-. (In fact, I have a little .ado file that is a wrapper for -encode- that handles all this for me.) While these practices seem cumbersome, and can lead to a project directory being a bit cluttered with little do-files that just generate labels, adherence to them has saved me from some pretty nasty analysis errors that are hard to root out otherwise. Clyde Schechter, MA MD Associate Professor of Family & Social Medicine Please note new e-mail address: clyde.schechter@einstein.yu.edu * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/