Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Sergiy Radyakin <serjradyakin@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: merging variables with labels |
Date | Thu, 13 Jan 2011 01:15:36 -0500 |
Ben describes a situation where two variables are obtained as a result of the encoding corresponding string variables. Stata automatically generated two systems of values labeling, which both overlap in their codes, even though the labels are different. When values of either of these two variables are put into the third one, neither of the labeling system applies, and the situation really reminds the classical "apples and oranges" basket. I will proceed under the assumption that the mixture problem is ruled out and the resulting variable makes sense. Stata's language is used in the following code, Mata can be more convenient and efficient. Here is an example: . clear . input str30 xs str30 ys price xs ys price 1. "North" "Red" 50 2. "South" "Green" 30 3. "East" "Blue" 48 4. "West" "Yellow" 24 5. "North" "Green" 36 6. "North" "Blue" 85 7. "South" "Red" 75 8. end . encode xs, generate(x) . encode ys, generate(y) . generate zs = cond(price>45,"x","y") . generate z = cond(price>45,x,y) . list, clean xs ys price x y zs z 1. North Red 50 North Red x 2 2. South Green 30 South Green y 2 3. East Blue 48 East Blue x 1 4. West Yellow 24 West Yellow y 4 5. North Green 36 North Green y 2 6. North Blue 85 North Blue x 2 7. South Red 75 South Red x 3 Here zs shows where the value is coming from and z the value of the new variable itself. When the value is coming from variable x the label from the x's labeling system must be used, and similarly for y. The problem stems from the fact that the two labeling systems were created independently from one another, and thus may not be used in such an operation. If the original string data is still available, Ben may change the above program to make the two encodings dependent on each other, by forcing Stata to use the same labeling system for both of them. The encode command provides a convenient option -label- exactly for this purpose. . clear . input str30 xs str30 ys price xs ys price 1. "North" "Red" 50 2. "South" "Green" 30 3. "East" "Blue" 48 4. "West" "Yellow" 24 5. "North" "Green" 36 6. "North" "Blue" 85 7. "South" "Red" 75 8. end . encode xs, generate(x) . encode ys, generate(y) label(x) . generate zs = cond(price>45,"x","y") . generate z = cond(price>45,x,y) . label values z x . list, clean xs ys price x y zs z 1. North Red 50 North Red x North 2. South Green 30 South Green y Green 3. East Blue 48 East Blue x East 4. West Yellow 24 West Yellow y Yellow 5. North Green 36 North Green y Green 6. North Blue 85 North Blue x North 7. South Red 75 South Red x South See that in the z variable the directions names are used when the values are taken from the variable x and colors names are used when the values are taken from the variable y. We further confirm that labels are applicable to this variable: . label list x: 1 East 2 North 3 South 4 West 5 Blue 6 Green 7 Red 8 Yellow If the original string information is not available anymore, Ben can adjust the values and the labels of one variable in such a way that they do not overlap with values of the other variable. For values, often adding a constant, e.g. 10000 is sufficient. It's slightly more difficult with the labeling system. But here is an example: . clear . input str30 xs str30 ys price xs ys price 1. "North" "Red" 50 2. "South" "Green" 30 3. "East" "Blue" 48 4. "West" "Yellow" 24 5. "North" "Green" 36 6. "North" "Blue" 85 7. "South" "Red" 75 8. end . encode xs, generate(x) . encode ys, generate(y) . list, clean xs ys price x y 1. North Red 50 North Red 2. South Green 30 South Green 3. East Blue 48 East Blue 4. West Yellow 24 West Yellow 5. North Green 36 North Green 6. North Blue 85 North Blue 7. South Red 75 South Red . preserve . uselabel x y, clear . assert trunc==0 . sum value if lname=="x" Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- value | 4 2.5 1.290994 1 4 . local shift = r(max)+1 . replace value=value+`shift' if lname=="y" (4 real changes made) . drop lname trunc . forvalues i=1/`=_N' { 2. label define zlab `=value[`i']' `"`=label[`i']'"', modify 3. } . tempfile lbl . label save zlab using `lbl' file C:\Users\Username\AppData\Local\Temp\ST_0000000q.tmp saved . restore . replace y=y+`shift' (7 real changes made) . label drop y // old labels not applicable anymore . do `lbl' . label define zlab 1 `"East"', modify . label define zlab 2 `"North"', modify . label define zlab 3 `"South"', modify . label define zlab 4 `"West"', modify . label define zlab 6 `"Blue"', modify . label define zlab 7 `"Green"', modify . label define zlab 8 `"Red"', modify . label define zlab 9 `"Yellow"', modify end of do-file . generate zs = cond(price>45,"x","y") . generate z = cond(price>45,x,y) . label values z zlab . list, clean xs ys price x y zs z 1. North Red 50 North 8 x North 2. South Green 30 South 7 y Green 3. East Blue 48 East 6 x East 4. West Yellow 24 West 9 y Yellow 5. North Green 36 North 7 y Green 6. North Blue 85 North 6 x North 7. South Red 75 South 8 x South . label list zlab: 1 East 2 North 3 South 4 West 6 Blue 7 Green 8 Red 9 Yellow x: 1 East 2 North 3 South 4 West end of do-file Note that the resulting listing is the same as with the first case. As always with Stata, there are other (and more efficient) ways of getting the same result. Finally to answer your question regarding obtaining the value based on the label. If you know the code, the label is obtained instantly: . sysuse auto . di `"`:label origin 1'"' Foreign . di `"`:label (foreign) 1'"' Foreign See help for extended macro functions. If you need an inverse transformation, i.e. obtaining the values by labels, you may do: . sysuse auto . uselabel origin . sum value if label=="Foreign", meanonly . di r(mean) 1 If you don't want to destroy the data, then loop through all labeled values, and use `:label labelname labeledvalue' to check if it matches the label you are looking for. I would personally prefer Mata, which simplifies the matter a lot: .mata st_local("v",strofreal(st_vlsearch("origin", "Foreign"))) .display `v' 1 Hope this helps. Best regards, Sergiy Radyakin On Wed, Jan 12, 2011 at 8:24 PM, Ben Hoen <bhoen@lbl.gov> wrote: > I have two variables (e.g. x and y) with value labels (e.g. xlab and ylab). > Both x and y were created via .encode (they were derived from string > variables) so they overlap in their underlying values but the value labels > are unique between x and y and within x and y. > > I would like to create a new variable z which would take on the value labels > of x (xlab) for some cases and y (ylab) for other cases based on an > expression. I do not care what the underlying values are, only that the > value labels are correctly assigned? > > Suggestions? > > Somewhat relatedly, is there a quick way on the command line to display the > underlying value of a variable if one only has the label? I imagine > something like: .display y if ylab="something" > > Ben > > Ben Hoen > Principal Research Associate > Lawrence Berkeley National Laboratory > Office: 845-758-1896 > Cell: 718-812-7589 > bhoen@lbl.gov > > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/