Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Dopping 1% observations, but numbers do not match
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: Dopping 1% observations, but numbers do not match
Date
Wed, 10 Apr 2013 12:15:05 +0100
It's not a good idea to use code you don't understand!
I understand you as indicating that you are unclear about what [_N]
implies under -by:-. My numbered point #3 put it in words. I wrote a
tutorial which is easily accessible (there's a .pdf online, as below),
so I won't add to what I have written.
SJ-2-1 pr0004 . . . . . . . . . . Speaking Stata: How to move step by: step
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q1/02 SJ 2(1):86--102 (no commands)
explains the use of the by varlist : construct to tackle
a variety of problems with group structure, ranging from
simple calculations for each of several groups to more
advanced manipulations that use the built-in _n and _N
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
Technique for what you are asking is exemplified by
sysuse auto
su mpg, detail
scalar p1 = r(p1)
count if mpg <= p1
drop if mpg <= p1
but I can't write it down without flagging that I don't recommend
-drop-ping like this.
Nick
[email protected]
On 10 April 2013 12:01, Miguel Angel Duran <[email protected]> wrote:
> Thank you very much, Nick, for your quick answer. Just one additional
> questin, if you don't mind. How would I drop unconditionally on the
> identifier? And in relation to this (given your answer, just to be sure I
> got it right), when an expression like "var[_N]" is used, what does it
> exactly mean?
Nick Cox
> Numerous problems here, at least potentially.
>
> 0. Dropping outliers defined by an arbitrary threshold is not everyone's
> idea of good data analysis practice. If you want comments on what is
> "right", this needs defending.
>
> 1. Just because 0.0388193 is reported as the 1% point does not mean that
> exactly 1% of observations have that value or less, even in a situation
> where 1% of the number of observations is an integer. There could be ties.
>
> 2. Precision. 0.0388193 can't be held exactly as a binary number.
> Perhaps what is reported as that is really something else, e.g.
>
> . di %21x 0.0388193
> +1.3e01f8fe83ff0X-005
>
> . di %21x 0.03881931
> +1.3e01fe5ce7b79X-005
>
> . di %21x 0.03881929
> +1.3e01f3a020467X-005
>
> The number of decimal places you see does not correspond to what Stata holds
> in storage.
>
> 3. You are dropping if and only if _all_ values for each identifier are less
> than equal to your threshold. But that would leave in the data any such
> values if there were a greater value for the same identifier. That is, you
> are dropping conditionally on the identifier, not unconditionally.
>
> Nick
> [email protected]
>
> On 10 April 2013 11:22, Miguel Angel Duran <[email protected]> wrote:
>
>> Will you please help me to know that what I am doing is right? To
>> eliminate outliers, I am trying to drop 1% of the observations with the
> lowest values.
>> To do so I use 'bysort entity (rcon1410a): drop if rcon1410a[_N] <=
>> 0.0388193'. Note that 'entity' is id, 'rcon1410a' is the relevant
>> variable, and 1% of the observations has a value that is lower than
>> 0.0388193 (this value is obtained from 'sum rcon1410a, detail'). Since
>> I have 415,000 observations, I should be dropping 1%*415,000=4,150.
>> Nevertheless, Stata informs me that using the abovementioned command I
>> have dropped 400 observations. Is this all right?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/