Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: merging variables with labels


From   Sergiy Radyakin <[email protected]>
To   [email protected]
Subject   Re: st: merging variables with labels
Date   Thu, 13 Jan 2011 01:15:36 -0500

Ben describes a situation where two variables are obtained as a result of
the encoding corresponding string variables. Stata automatically generated
two systems of values labeling, which both overlap in their codes, even
though the labels are different.

When values of either of these two variables are put into the third one,
neither of the labeling system applies, and the situation really reminds the
classical "apples and oranges" basket. I will proceed under the assumption
that the mixture problem is ruled out and the resulting variable makes sense.

Stata's language is used in the following code, Mata can be more convenient
and efficient.

Here is an example:

. clear
. input str30 xs str30 ys price
                                 xs                              ys      price
  1.         "North" "Red" 50
  2.         "South" "Green" 30
  3.         "East" "Blue" 48
  4.         "West" "Yellow" 24
  5.         "North" "Green" 36
  6.         "North" "Blue" 85
  7.         "South" "Red" 75
  8. end
. encode xs, generate(x)
. encode ys, generate(y)
. generate zs = cond(price>45,"x","y")
. generate z = cond(price>45,x,y)
. list, clean
          xs       ys   price       x        y   zs   z
  1.   North      Red      50   North      Red    x   2
  2.   South    Green      30   South    Green    y   2
  3.    East     Blue      48    East     Blue    x   1
  4.    West   Yellow      24    West   Yellow    y   4
  5.   North    Green      36   North    Green    y   2
  6.   North     Blue      85   North     Blue    x   2
  7.   South      Red      75   South      Red    x   3

Here zs shows where the value is coming from and z the value of the
new variable itself. When the value is coming from variable x the label
from the x's labeling system must be used, and similarly for y.

The problem stems from the fact that the two labeling systems were
created independently from one another, and thus may not be used
in such an operation.

If the original string data is still available, Ben may change the above
program to make the two encodings dependent on each other, by
forcing Stata to use the same labeling system for both of them.
The encode command provides a convenient option -label- exactly
for this purpose.

. clear
. input str30 xs str30 ys price
                                 xs                              ys      price
  1.         "North" "Red" 50
  2.         "South" "Green" 30
  3.         "East" "Blue" 48
  4.         "West" "Yellow" 24
  5.         "North" "Green" 36
  6.         "North" "Blue" 85
  7.         "South" "Red" 75
  8. end
. encode xs, generate(x)
. encode ys, generate(y) label(x)
. generate zs = cond(price>45,"x","y")
. generate z = cond(price>45,x,y)
. label values z x
. list, clean
          xs       ys   price       x        y   zs        z
  1.   North      Red      50   North      Red    x    North
  2.   South    Green      30   South    Green    y    Green
  3.    East     Blue      48    East     Blue    x     East
  4.    West   Yellow      24    West   Yellow    y   Yellow
  5.   North    Green      36   North    Green    y    Green
  6.   North     Blue      85   North     Blue    x    North
  7.   South      Red      75   South      Red    x    South


See that in the z variable the directions names are used when the
values are taken from the variable x and colors names are used when
the values are taken from the variable y.

We further confirm that labels are applicable to this variable:
. label list
x:
           1 East
           2 North
           3 South
           4 West
           5 Blue
           6 Green
           7 Red
           8 Yellow

If the original string information is not available anymore, Ben can
adjust the values and the labels of one variable in such a way that they
do not overlap with values of the other variable.

For values, often adding a constant, e.g. 10000 is sufficient.
It's slightly more difficult with the labeling system. But here is an example:

. clear
. input str30 xs str30 ys price
                                 xs                              ys      price
  1.         "North" "Red" 50
  2.         "South" "Green" 30
  3.         "East" "Blue" 48
  4.         "West" "Yellow" 24
  5.         "North" "Green" 36
  6.         "North" "Blue" 85
  7.         "South" "Red" 75
  8. end
. encode xs, generate(x)
. encode ys, generate(y)
. list, clean
          xs       ys   price       x        y
  1.   North      Red      50   North      Red
  2.   South    Green      30   South    Green
  3.    East     Blue      48    East     Blue
  4.    West   Yellow      24    West   Yellow
  5.   North    Green      36   North    Green
  6.   North     Blue      85   North     Blue
  7.   South      Red      75   South      Red
. preserve
. uselabel x y, clear
. assert trunc==0
. sum value if lname=="x"
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       value |         4         2.5    1.290994          1          4
. local shift = r(max)+1
. replace value=value+`shift' if lname=="y"
(4 real changes made)
. drop lname trunc
. forvalues i=1/`=_N' {
  2.   label define zlab `=value[`i']' `"`=label[`i']'"', modify
  3. }
. tempfile lbl
. label save zlab using `lbl'
file C:\Users\Username\AppData\Local\Temp\ST_0000000q.tmp saved
. restore
. replace y=y+`shift'
(7 real changes made)
. label drop y // old labels not applicable anymore
. do `lbl'
. label define zlab 1 `"East"', modify
. label define zlab 2 `"North"', modify
. label define zlab 3 `"South"', modify
. label define zlab 4 `"West"', modify
. label define zlab 6 `"Blue"', modify
. label define zlab 7 `"Green"', modify
. label define zlab 8 `"Red"', modify
. label define zlab 9 `"Yellow"', modify
end of do-file
. generate zs = cond(price>45,"x","y")
. generate z = cond(price>45,x,y)
. label values z zlab
. list, clean
          xs       ys   price       x   y   zs        z
  1.   North      Red      50   North   8    x    North
  2.   South    Green      30   South   7    y    Green
  3.    East     Blue      48    East   6    x     East
  4.    West   Yellow      24    West   9    y   Yellow
  5.   North    Green      36   North   7    y    Green
  6.   North     Blue      85   North   6    x    North
  7.   South      Red      75   South   8    x    South
. label list
zlab:
           1 East
           2 North
           3 South
           4 West
           6 Blue
           7 Green
           8 Red
           9 Yellow
x:
           1 East
           2 North
           3 South
           4 West
end of do-file

Note that the resulting listing is the same as with the first case.
As always with Stata, there are other (and more efficient) ways of
getting the same result.

Finally to answer your question regarding obtaining the value based on
the label.

If you know the code, the label is obtained instantly:
. sysuse auto
. di `"`:label origin 1'"'
Foreign
. di `"`:label (foreign) 1'"'
Foreign

See help for extended macro functions.

If you need an inverse transformation, i.e. obtaining the values by
labels, you may do:
. sysuse auto
. uselabel origin
. sum value if label=="Foreign", meanonly
. di r(mean)
1

If you don't want to destroy the data, then loop through all labeled
values, and use
`:label labelname labeledvalue' to check if it matches the label you
are looking for.

I would personally prefer Mata, which simplifies the matter a lot:

.mata st_local("v",strofreal(st_vlsearch("origin", "Foreign")))
.display `v'
1

Hope this helps.

Best regards, Sergiy Radyakin



On Wed, Jan 12, 2011 at 8:24 PM, Ben Hoen <[email protected]> wrote:
> I have two variables (e.g. x and y) with value labels (e.g. xlab and ylab).
> Both x and y were created via .encode (they were derived from string
> variables) so they overlap in their underlying values but the value labels
> are unique between x and y and within x and y.
>
> I would like to create a new variable z which would take on the value labels
> of x (xlab) for some cases and y (ylab) for other cases based on an
> expression.  I do not care what the underlying values are, only that the
> value labels are correctly assigned?
>
> Suggestions?
>
> Somewhat relatedly, is there a quick way on the command line to display the
> underlying value of a variable if one only has the label?  I imagine
> something like: .display y if ylab="something"
>
> Ben
>
> Ben Hoen
> Principal Research Associate
> Lawrence Berkeley National Laboratory
> Office: 845-758-1896
> Cell: 718-812-7589
> [email protected]
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index