Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: limitations of "generate" with missing data


From   Steven Samuels <[email protected]>
To   [email protected]
Subject   Re: st: limitations of "generate" with missing data
Date   Tue, 12 Apr 2011 00:02:53 -0400

Michael, lest you think this problem is unique to Stata, I would add that SAS sorts missing values before, not after, non-missing ones. SPSS will sort some missing values ("user-defined"), but not others ("system missing").

Steve
[email protected]





On Apr 11, 2011, at 6:15 PM, Nick Cox wrote:

The underlying problem can be illustrated by sorting. Suppose we
-sort- a variable, which contains missings, in numeric order. Where do
the missings go? We need a decision: either missing is regarded as
larger than any non-missing, or smaller than any non-missing. Stata
made the first decision.

Any way, here are some solutions:

gen myvar1 =  (gread_comp_score_pcnt>.79) if gread_comp_pcnt < .

gen myvar2 =  (gread_comp_score_pcnt>.79) if !missing(gread_comp_pcnt)

gen myvar3 = cond(missing(gread_comp_pcnt), ., (gread_comp_score_pcnt > .79)

gen myvar4 = (gread_comp_score_pcnt > .79) / (!missing(gread_comp_pcnt))

(5. don't throw away information by turning a measure into an indicator!)

Nick

On Mon, Apr 11, 2011 at 11:01 PM, Michael Costello
<[email protected]> wrote:
> Statalisters,
> 
> I recently ran into a problem with the following dataset:
> 
> . tab  gread_comp_score_pcnt, m
> gread_comp_ |
>  score_pcnt |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>          0 |        150        7.50        7.50
>         .2 |         85        4.25       11.75
>         .4 |         97        4.85       16.60
>         .6 |         82        4.10       20.70
>         .8 |         72        3.60       24.30
>          1 |         15        0.75       25.05
>          . |      1,499       74.95      100.00
> ------------+-----------------------------------
>      Total |      2,000      100.00
> 
> The high number of "missing" is by design, a by-product of a
> horizontally structured dataset that I have yet to rectify.
> 
> When I run the command:
> gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79)
> I am left with
> 
> . tab  gread_comp_score_pcnt80, m
> gread_comp_ |
> score_pcnt8 |
>          0 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>          0 |        414       20.70       20.70
>          1 |      1,586       79.30      100.00
> ------------+-----------------------------------
>      Total |      2,000      100.00
> 
> As you can see, the 87 values above .79 were set to 1, but so were all
> the missing values!!  I have toyed with the code a bit, trying
> variations such as
> . gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79 &
> gread_comp_score_pcnt!=.)
> but that converts all the missing to 0's, which is only marginally better.
> 
> So the question is, is there some way to use a single, precise line of
> code to create eighty-seven 1's, four hundred fourteen  0's and 1499
> Missing values in one dummy variable?  I know I can do it with several
> lines of code, but I'm looking for something more concise, as it needs
> to run many hundreds of times.
> 

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index