Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: limitations of "generate" with missing data
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: limitations of "generate" with missing data
Date
Tue, 12 Apr 2011 09:54:26 +0100
If we generalise to
gen result = a > b
and focus on -a-, -b- numeric (the comparison makes sense for strings
too) then in a way it's reasonable to expect three possible answers,
say 0 for false, 1 for true and ? for "can't tell; at least one
argument is missing". I use ? for the sake of argument to detach the
argument slightly from what Stata does at present.
I've been at two users' meetings when talks have proposed three-way
logic of this kind. The talk is sure-fire guaranteed to generate
discussion about as long as the talk and to split the audience three
ways, namely
1. Stata's two-way logic often bites -- most commonly perhaps in this
case -- but you get used to it, mostly, it is too late for "fix" it,
and no other solution is better.
2. There is a good case for a three-way logic but definitely not that
proposed by the speaker, which is quite illogical.
3. The speaker is right and Stata is fundamentally flawed and should
change its ways and documentation forthwith.
I think the crunch is that although a different rule may make sense
for at least some problems, the bigger difficulty is being consistent
and having as few rules as possible and not introducing problems that
are worse, and more difficult to understand. For example, and a very
long article could be written about this, although I don't intend to
do it:
* Once ? is allowed as a logical result, then the truth tables need to
be expanded for 0 & ?, 1 & ?, 0 | ?, etc., etc.
* Once ? is allowed as a logical result, you need a rule on where it
goes on sorting. (That need not be that ? is just numeric missing.)
* Once ? is allowed as a logical result, what about ? + a, ? - a, ...
log(?). Those are probably all easy but that won't stop users being
puzzled by the results.
* What about .a ... .z ???
* Do you need new functions and operators?
* If you change Stata, quite what is allowed under version control?
I know that no-one is necessarily proposing _any_ of this: I am just
showing one way or the other how many threads are tangled together
when you start wanting something different.
Any way, note that interpretation of
gen result = a > b if !missing(a, b)
is that you don't know what the result should be if either argument is
missing, not that Stata can't tell. But you get missings from missings
either way.
Nick
On Tue, Apr 12, 2011 at 5:02 AM, Steven Samuels <[email protected]> wrote:
> Michael, lest you think this problem is unique to Stata, I would add that SAS sorts missing values before, not after, non-missing ones. SPSS will sort some missing values ("user-defined"), but not others ("system missing").
>
> Steve
> [email protected]
>
>
>
>
>
> On Apr 11, 2011, at 6:15 PM, Nick Cox wrote:
>
> The underlying problem can be illustrated by sorting. Suppose we
> -sort- a variable, which contains missings, in numeric order. Where do
> the missings go? We need a decision: either missing is regarded as
> larger than any non-missing, or smaller than any non-missing. Stata
> made the first decision.
>
> Any way, here are some solutions:
>
> gen myvar1 = (gread_comp_score_pcnt>.79) if gread_comp_pcnt < .
>
> gen myvar2 = (gread_comp_score_pcnt>.79) if !missing(gread_comp_pcnt)
>
> gen myvar3 = cond(missing(gread_comp_pcnt), ., (gread_comp_score_pcnt > .79)
>
> gen myvar4 = (gread_comp_score_pcnt > .79) / (!missing(gread_comp_pcnt))
>
> (5. don't throw away information by turning a measure into an indicator!)
>
> Nick
>
> On Mon, Apr 11, 2011 at 11:01 PM, Michael Costello
> <[email protected]> wrote:
>> Statalisters,
>>
>> I recently ran into a problem with the following dataset:
>>
>> . tab gread_comp_score_pcnt, m
>> gread_comp_ |
>> score_pcnt | Freq. Percent Cum.
>> ------------+-----------------------------------
>> 0 | 150 7.50 7.50
>> .2 | 85 4.25 11.75
>> .4 | 97 4.85 16.60
>> .6 | 82 4.10 20.70
>> .8 | 72 3.60 24.30
>> 1 | 15 0.75 25.05
>> . | 1,499 74.95 100.00
>> ------------+-----------------------------------
>> Total | 2,000 100.00
>>
>> The high number of "missing" is by design, a by-product of a
>> horizontally structured dataset that I have yet to rectify.
>>
>> When I run the command:
>> gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79)
>> I am left with
>>
>> . tab gread_comp_score_pcnt80, m
>> gread_comp_ |
>> score_pcnt8 |
>> 0 | Freq. Percent Cum.
>> ------------+-----------------------------------
>> 0 | 414 20.70 20.70
>> 1 | 1,586 79.30 100.00
>> ------------+-----------------------------------
>> Total | 2,000 100.00
>>
>> As you can see, the 87 values above .79 were set to 1, but so were all
>> the missing values!! I have toyed with the code a bit, trying
>> variations such as
>> . gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79 &
>> gread_comp_score_pcnt!=.)
>> but that converts all the missing to 0's, which is only marginally better.
>>
>> So the question is, is there some way to use a single, precise line of
>> code to create eighty-seven 1's, four hundred fourteen 0's and 1499
>> Missing values in one dummy variable? I know I can do it with several
>> lines of code, but I'm looking for something more concise, as it needs
>> to run many hundreds of times.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/