Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: merging variables with labels
From
Sergiy Radyakin <[email protected]>
To
[email protected]
Subject
Re: st: merging variables with labels
Date
Thu, 13 Jan 2011 01:15:36 -0500
Ben describes a situation where two variables are obtained as a result of
the encoding corresponding string variables. Stata automatically generated
two systems of values labeling, which both overlap in their codes, even
though the labels are different.
When values of either of these two variables are put into the third one,
neither of the labeling system applies, and the situation really reminds the
classical "apples and oranges" basket. I will proceed under the assumption
that the mixture problem is ruled out and the resulting variable makes sense.
Stata's language is used in the following code, Mata can be more convenient
and efficient.
Here is an example:
. clear
. input str30 xs str30 ys price
xs ys price
1. "North" "Red" 50
2. "South" "Green" 30
3. "East" "Blue" 48
4. "West" "Yellow" 24
5. "North" "Green" 36
6. "North" "Blue" 85
7. "South" "Red" 75
8. end
. encode xs, generate(x)
. encode ys, generate(y)
. generate zs = cond(price>45,"x","y")
. generate z = cond(price>45,x,y)
. list, clean
xs ys price x y zs z
1. North Red 50 North Red x 2
2. South Green 30 South Green y 2
3. East Blue 48 East Blue x 1
4. West Yellow 24 West Yellow y 4
5. North Green 36 North Green y 2
6. North Blue 85 North Blue x 2
7. South Red 75 South Red x 3
Here zs shows where the value is coming from and z the value of the
new variable itself. When the value is coming from variable x the label
from the x's labeling system must be used, and similarly for y.
The problem stems from the fact that the two labeling systems were
created independently from one another, and thus may not be used
in such an operation.
If the original string data is still available, Ben may change the above
program to make the two encodings dependent on each other, by
forcing Stata to use the same labeling system for both of them.
The encode command provides a convenient option -label- exactly
for this purpose.
. clear
. input str30 xs str30 ys price
xs ys price
1. "North" "Red" 50
2. "South" "Green" 30
3. "East" "Blue" 48
4. "West" "Yellow" 24
5. "North" "Green" 36
6. "North" "Blue" 85
7. "South" "Red" 75
8. end
. encode xs, generate(x)
. encode ys, generate(y) label(x)
. generate zs = cond(price>45,"x","y")
. generate z = cond(price>45,x,y)
. label values z x
. list, clean
xs ys price x y zs z
1. North Red 50 North Red x North
2. South Green 30 South Green y Green
3. East Blue 48 East Blue x East
4. West Yellow 24 West Yellow y Yellow
5. North Green 36 North Green y Green
6. North Blue 85 North Blue x North
7. South Red 75 South Red x South
See that in the z variable the directions names are used when the
values are taken from the variable x and colors names are used when
the values are taken from the variable y.
We further confirm that labels are applicable to this variable:
. label list
x:
1 East
2 North
3 South
4 West
5 Blue
6 Green
7 Red
8 Yellow
If the original string information is not available anymore, Ben can
adjust the values and the labels of one variable in such a way that they
do not overlap with values of the other variable.
For values, often adding a constant, e.g. 10000 is sufficient.
It's slightly more difficult with the labeling system. But here is an example:
. clear
. input str30 xs str30 ys price
xs ys price
1. "North" "Red" 50
2. "South" "Green" 30
3. "East" "Blue" 48
4. "West" "Yellow" 24
5. "North" "Green" 36
6. "North" "Blue" 85
7. "South" "Red" 75
8. end
. encode xs, generate(x)
. encode ys, generate(y)
. list, clean
xs ys price x y
1. North Red 50 North Red
2. South Green 30 South Green
3. East Blue 48 East Blue
4. West Yellow 24 West Yellow
5. North Green 36 North Green
6. North Blue 85 North Blue
7. South Red 75 South Red
. preserve
. uselabel x y, clear
. assert trunc==0
. sum value if lname=="x"
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
value | 4 2.5 1.290994 1 4
. local shift = r(max)+1
. replace value=value+`shift' if lname=="y"
(4 real changes made)
. drop lname trunc
. forvalues i=1/`=_N' {
2. label define zlab `=value[`i']' `"`=label[`i']'"', modify
3. }
. tempfile lbl
. label save zlab using `lbl'
file C:\Users\Username\AppData\Local\Temp\ST_0000000q.tmp saved
. restore
. replace y=y+`shift'
(7 real changes made)
. label drop y // old labels not applicable anymore
. do `lbl'
. label define zlab 1 `"East"', modify
. label define zlab 2 `"North"', modify
. label define zlab 3 `"South"', modify
. label define zlab 4 `"West"', modify
. label define zlab 6 `"Blue"', modify
. label define zlab 7 `"Green"', modify
. label define zlab 8 `"Red"', modify
. label define zlab 9 `"Yellow"', modify
end of do-file
. generate zs = cond(price>45,"x","y")
. generate z = cond(price>45,x,y)
. label values z zlab
. list, clean
xs ys price x y zs z
1. North Red 50 North 8 x North
2. South Green 30 South 7 y Green
3. East Blue 48 East 6 x East
4. West Yellow 24 West 9 y Yellow
5. North Green 36 North 7 y Green
6. North Blue 85 North 6 x North
7. South Red 75 South 8 x South
. label list
zlab:
1 East
2 North
3 South
4 West
6 Blue
7 Green
8 Red
9 Yellow
x:
1 East
2 North
3 South
4 West
end of do-file
Note that the resulting listing is the same as with the first case.
As always with Stata, there are other (and more efficient) ways of
getting the same result.
Finally to answer your question regarding obtaining the value based on
the label.
If you know the code, the label is obtained instantly:
. sysuse auto
. di `"`:label origin 1'"'
Foreign
. di `"`:label (foreign) 1'"'
Foreign
See help for extended macro functions.
If you need an inverse transformation, i.e. obtaining the values by
labels, you may do:
. sysuse auto
. uselabel origin
. sum value if label=="Foreign", meanonly
. di r(mean)
1
If you don't want to destroy the data, then loop through all labeled
values, and use
`:label labelname labeledvalue' to check if it matches the label you
are looking for.
I would personally prefer Mata, which simplifies the matter a lot:
.mata st_local("v",strofreal(st_vlsearch("origin", "Foreign")))
.display `v'
1
Hope this helps.
Best regards, Sergiy Radyakin
On Wed, Jan 12, 2011 at 8:24 PM, Ben Hoen <[email protected]> wrote:
> I have two variables (e.g. x and y) with value labels (e.g. xlab and ylab).
> Both x and y were created via .encode (they were derived from string
> variables) so they overlap in their underlying values but the value labels
> are unique between x and y and within x and y.
>
> I would like to create a new variable z which would take on the value labels
> of x (xlab) for some cases and y (ylab) for other cases based on an
> expression. I do not care what the underlying values are, only that the
> value labels are correctly assigned?
>
> Suggestions?
>
> Somewhat relatedly, is there a quick way on the command line to display the
> underlying value of a variable if one only has the label? I imagine
> something like: .display y if ylab="something"
>
> Ben
>
> Ben Hoen
> Principal Research Associate
> Lawrence Berkeley National Laboratory
> Office: 845-758-1896
> Cell: 718-812-7589
> [email protected]
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/