--On Saturday, October 26, 2002 2:33 -0400 Richard wrote:
Unfortunately all my data is loaded with string variables for codes for
various diseases, hospital procedures, geographic codes, etc. and to put
those labels as part of the database would significantly enlarge the
database. I tried the encode route but this means that every database I
have with the same set of codes has a different set of encoded values.
This would seem to be a relational database definitional issue (something
that occupies altogether too many of my brain cells these days). Say you
take one comprehensive set of disease codes (string) and encode them,
saving those two variables (the string and the arbitrary integer which has
been assigned) to a new dataset. Now make the value labels apply to that
integer. You will then have a dataset with two variables: the codes, in
string form, and the integer, which will then be the value label of that
string. This dataset may be merged, using the string, onto any other
dataset, and you will end up with those two variables in any other dataset
which has a disease variable. Not quite the same as what you're requesting
(which sounds reasonable) but it gives the flavor, I think, of providing
the longer 'aliases' to your string codes. As Nick Cox said, this is really
an issue of having 'short' and 'long' versions of the same variable, sort
of like we could use 'NJC' or CFB' in one context and "Nicholas J. Cox" or
"Christopher F Baum" in another.
You can tabulate the integer variable, and it will display its value
label--which might be Waterhouse Friderichsen syndrome, or whatever.
This is essentially Nick Winter's suggestion, I think (including his point
that this will ensure the unique definition). But I would like to promote
the understanding that thinking of these things as relational database
issues (even though Stata is not a RDBMS) is often useful. Furthermore, in
terms of your concern for storage space, you need not keep the merged
version of the dataset -- just merge the definitions file on when you need
to see the 'long names', or when you're producing tables that should have
those names. There may be other contexts where you're just doing data
manipulation or estimation and this added detail is unnecessary. If merging
the files on demand is less time-consuming then permanently adding a huge
amount to their size (which will cause them to be more slowly read in),
that would be a good idea.
Erik went on to say that
This may be fine if possible, but for those of us who regularly work
with millions of observations, it is often not feasible. That it is not
possible to label string values is a shortcoming that should be fixed
in future releases.
which gets to my point of this as an RDBMS issue. If there are 20,000
diseases, then there are 20,000 long "value label" forms which must be
stored. Whether you call them value labels or not does not matter. The
issue is whether those long-form 'labels' are to be permanently stored on
each of the million records. They **need not be** if they are defined as
the value label of the integer variable defined above. The overhead, as
Nick Winter indicated as well, is then the addition of one integer per case
-- or 4 bytes for each of the million records -- plus the space needed to
store up to 64K value labels for that variable. I don't see this as much
less convenient than having the value label directly attached to the
original string variable -- which in this scheme is just another way of
saying disease no. 1234, which now has a 'short name' like D820 and a 'long
name' like Wiskott Aldrich syndrome.
Kit
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/