Unless there is some information regarding the selection to the final
sample -- brute force is the only way. It may be direct ( cycle
for-each-value-check-if-it-is-there) or it could be more involved, but
with the same thing going on behind the scenes. One thing to concider
however is whether you have more deleted labels or those that are
kept. E.g. in some cases it might be more efficient to cycle through
the observations that are left, than through all the labels,
especially if they (observations) are unique. Example: you have
observations, each representing an occupation, each occupation has a
label, you want to keep only "dangerous" occupations (defined as you
like). There will likely be relatively few of them among all, so go
brute force by observations, and keep the labels, that they are using.
You can also define your labels as a dataset with two fields: numeric
code and string label. After the selection in the data occurred, you
can merge the two datasets to determine, which labels must be kept.
But the overhead from having the labels should not be very large.
Be advised also, that by dropping the labels, you can significantly
change the meaning of the remaining labels, the problem I had once:
Canada-10
USA-20
Russia-0
China-5
India-20
Germany-0
Rest of the World -50
Let the numbers represent the frequencies, and as you said, you want
to drop those labels, that are not used in the data. By removing
Russia and Germany from the list of labels, you are getting the
following:
Canada-10
USA-20
China-5
India-20
Rest of the World -50
Now if someone is looking at this tabulation, it is natural to think
that Russia and Germany are in the category "Rest of the World", since
they are not mentioned, and may, theoretically have positive values,
contributing to those 50. This is probably not something you want to
happen, unless your data is going to be well-well-well documented. In
the extreme case, you have just one label "other" and let your data
user guess, "other than what???"
Best regards, Sergiy Radyakin
On Feb 15, 2008 10:25 PM, David Elliott <[email protected]> wrote:
> In the same vein, I have been looking for a less than brute force way
> to drop individual value definitions for values which are no longer in
> my data. For example, I have a data set that has literally thousands
> of code descriptions that have been encoded. When I subsequently
> -keep- only a few values, I still have all the parent dataset's values
> defined which can represent a significant overhead in the resulting
> dataset. I'd like to reduce the value definitions to only those that
> still have corresponding values.
>
> As I stated above, I can use a brute force approach of cycling through
> individual label value definitions and dropping (-lab def mydef # "" ,
> modify- where # is a value no longer in the dataset), but suspect
> there is a more elegant way to do this.
>
> DCE
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/