[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: keeping an inventory of dropped observations

From	n j cox <[email protected]>
To	[email protected]
Subject	Re: st: keeping an inventory of dropped observations
Date	Mon, 02 Apr 2007 12:34:34 +0100

Friedrich Hueblet had an interesting suggestion. Here are some
further comments.

There are two problems with your code. The simpler one is the need to remove typos:

gen mark_for_drop = 0
gen reason_for_drop = ""
replace mark_for_drop = 1 if eodlymph == 99
replace reason_for_drop = "missing lymph" if eodlymph == 99
...
drop if eodlymph == 99
...
tab mark_for_drop reason_for_drop if mark_for_drop == 1

The more fundamental one is that you can't have it both ways. Once observations have been -drop-ped, they are not available to -tabulate- (or do anything else with). You are not going to see "missing lymph" in your table because you already -drop-ped those observations.

You are here combining two good ideas that don't really combine. The first is to document data management by a -log- or -cmdlog- file. It is perhaps worth underlining that you can annotate this as desired with comments:

* problem: missing lymph if eodlymph == 99
drop if eodlymph == 99

In general, you would need also to document the use of -keep-, -reshape-, -contract-, etc.

The second is to record, within a dataset, which observations are not included in a particular analysis. The only way that you can easily do that is by not -drop-ping them, but marking them in some way, and you are most of the way there. The -mark- command offers some technique, but it is as easy to reinvent it. Programmers' standard techniques can be borrowed. See also, in due course, a Tip from Ben Jann in Stata Journal 7(2).

gen byte touse = 1
gen problem = ""
replace touse = 0 if eodlymph == 99
replace problem = problem + "missing lymph; " if eodlymph == 99
...
<analysis> if touse
...
tab problem if !touse

Note the use of

replace problem = problem + "<reason>; " if ...

which might be advisable if observations could be problematic for more than one
reason. Of course, you may need to use abbreviations, codes, etc.

Another possibility is the use of -notes-, but I don't think that is what you are looking for really.

Nick
[email protected]

Michael McCulloch <[email protected]>

While cleaning a dataset, I'm periodically dropping observations that
meet certain criteria, for example:
drop if eodlymph==99

Since this occurs very often within a long do-file, I'd like to keep
an inventory of dropped observations & my reason for doing so. Aside
from manually searching through my log file, is there a more elegant
way than what I suggest below, to do this?

For example:
gen mark_for_drop=0
gen reason_for_drop=.
replace mark_for_drop=1 if eodlymph==99
replace reason_for_drop="missing lymph" if eodlymph==99
...
drop if eodlymph==99
...
tab mark_for_drop reason_for_drop if reason_for_drop==1

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: extract data points from series
Next by Date: Re: st: data management
Previous by thread: Re: st: extract data points from series
Next by thread: Re: st: data management
Index(es):
- Date
- Thread