It requires several extra steps. And encoding one database results in
different encodes for another. But most generally, suppose I had several
variables with the same code set? - as is the case. Suppose it were the
other way around, that is, no value labels for numerical variable? I think
folks would find this difficult to deal with. I would submit that for many
Stata users, string variables are often an analysis variable, so... value
labels are (would be) handy.
In SAS you set up a format and then whenever you need to, you apply that
format to any variable you choose. You can store the format and do not have
to create it but once. No merging, no encoding each variable, no sorting,
etc.
I have found an interim way to deal with it as suggested. It's easiest to
just make a new variable with the string merging a labeling dataset with the
big one. Of course seldom do I really need all million records, so after it
is subsetted, I merge the label dataset set then. Its pretty easy to set up
in a macro.
Thanks for all the response and discussion and patience with a new Stata
user. This round certainly taught me a lot of things I needed to know.
Best regards,
Richard Hoskins
-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of baum
Sent: Sunday, October 27, 2002 7:52 AM
To: [email protected]
Subject: st: Re: value labels for string variables
--On Sunday, October 27, 2002 2:33 -0500 Richard wrote:
> SAS, SPSS. and S-Plus allow value labels for string variables. Also they
> allow the development of value labels independent of the database being
> value labeled. (Proc Format)
>
> STATA does not, at least not without some (considerable) rigamorrole.
>
> Maybe Stata people will fix this.
>
> At present the soltuion (that is quickest) seems to be developing a new
> variable using the valkue label as a variable value. This is not
> database-wise efficient. And these labels are not easily reduced to short
> strings; subtle disease distinctions are difficult to reduce to a few
> characters.
I don't see the issue here. Say that you have a million records, and one
string variable recorded therein is str2 state, AK..WY [DC PR]. You do not
want to store the 'long name' of the state in the database, so you set up a
new dataset with 50, 51 or 52 cases, containing str2 state. You encode
state into int statename, and you define a value label containing 50, 51 or
52 values for that integer variable, containing the long names of states.
If you then merge this dataset with your million-record dataset on state,
statename will contain the 'value labels' of state, but will not store them
as strings; it will store the integer value underlying. The overhead
associated with this strategy is merely one integer per case (in the case
of states, I could use a byte data type; in general an int will suffice).
The following dataset appears to have long names for 'statename'; in
reality it is an integer with a value label.
state var2 statename
1. MA 222 Massachusetts
2. MA 999 Massachusetts
3. MA 111 Massachusetts
4. ME 888 Maine
5. ME 333 Maine
6. NH 444 New Hampshire
7. NH 777 New Hampshire
8. VT 666 Vermont
9. VT 555 Vermont
It does not seem to me that this strategy is onerous, and it is only
limited by the existing limits on labels. I do not imagine that the
overhead involved can be improved upon in other packages' implementation of
this feature; if you want to associate one of ~32K or ~64K string values
with a case, you need an appropriately sized integer.
Kit
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/