Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: limitations of "generate" with missing data
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: limitations of "generate" with missing data
Date
Mon, 11 Apr 2011 23:19:51 +0100
Add ) at end in #3.
On Mon, Apr 11, 2011 at 11:15 PM, Nick Cox <[email protected]> wrote:
> The underlying problem can be illustrated by sorting. Suppose we
> -sort- a variable, which contains missings, in numeric order. Where do
> the missings go? We need a decision: either missing is regarded as
> larger than any non-missing, or smaller than any non-missing. Stata
> made the first decision.
>
> Any way, here are some solutions:
>
> gen myvar1 = (gread_comp_score_pcnt>.79) if gread_comp_pcnt < .
>
> gen myvar2 = (gread_comp_score_pcnt>.79) if !missing(gread_comp_pcnt)
>
> gen myvar3 = cond(missing(gread_comp_pcnt), ., (gread_comp_score_pcnt > .79)
>
> gen myvar4 = (gread_comp_score_pcnt > .79) / (!missing(gread_comp_pcnt))
>
> (5. don't throw away information by turning a measure into an indicator!)
>
> Nick
>
> On Mon, Apr 11, 2011 at 11:01 PM, Michael Costello
> <[email protected]> wrote:
>> Statalisters,
>>
>> I recently ran into a problem with the following dataset:
>>
>> . tab gread_comp_score_pcnt, m
>> gread_comp_ |
>> score_pcnt | Freq. Percent Cum.
>> ------------+-----------------------------------
>> 0 | 150 7.50 7.50
>> .2 | 85 4.25 11.75
>> .4 | 97 4.85 16.60
>> .6 | 82 4.10 20.70
>> .8 | 72 3.60 24.30
>> 1 | 15 0.75 25.05
>> . | 1,499 74.95 100.00
>> ------------+-----------------------------------
>> Total | 2,000 100.00
>>
>> The high number of "missing" is by design, a by-product of a
>> horizontally structured dataset that I have yet to rectify.
>>
>> When I run the command:
>> gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79)
>> I am left with
>>
>> . tab gread_comp_score_pcnt80, m
>> gread_comp_ |
>> score_pcnt8 |
>> 0 | Freq. Percent Cum.
>> ------------+-----------------------------------
>> 0 | 414 20.70 20.70
>> 1 | 1,586 79.30 100.00
>> ------------+-----------------------------------
>> Total | 2,000 100.00
>>
>> As you can see, the 87 values above .79 were set to 1, but so were all
>> the missing values!! I have toyed with the code a bit, trying
>> variations such as
>> . gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79 &
>> gread_comp_score_pcnt!=.)
>> but that converts all the missing to 0's, which is only marginally better.
>>
>> So the question is, is there some way to use a single, precise line of
>> code to create eighty-seven 1's, four hundred fourteen 0's and 1499
>> Missing values in one dummy variable? I know I can do it with several
>> lines of code, but I'm looking for something more concise, as it needs
>> to run many hundreds of times.
>>
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/