| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: RE: RE: IQR
I spent a while updating -iqr- to -iqr8-.
This was unnecessary, because -iqr- works
fine under version control. (How many programs
would run without change in other software after 16 years?)
Nevertheless, few Stata users will now be accustomed to reading
or writing Stata like this:
---------------------------- iqr (L. Hamilton)
*Outliers and summary statistics for simple normality check.
program define iqr
version 2.1
capture drop tempvar
mac def _varlist "required existing max(1)"
mac def _if "optional"
mac def _in "optional"
parse "%_*"
quietly generate tempvar=%_1 %_if %_in
quietly summ tempvar, detail
macro define NN=_result(1)
display
#delimit ;
di in gr " mean= " in ye %6.0g _result(3) in gr _col(20)
" std.dev.= " in ye %6.0g sqrt(_result(4))
in gr _col(50) " (n= " in ye _result(1) in gr ")";
di in gr " median= " in ye %6.0g _result(10)
in gr _col(20) "pseudo std.dev.= "
in ye %6.0g (_result(11)-_result(9))/1.349
in gr _col(50) " (IQR= " in ye %6.0g _result(11)-_result(9)
in gr ")";
mac define _lowfen1=_result(9)-1.5*(_result(11)-_result(9));
mac define _hifen1=_result(11)+1.5*(_result(11)-_result(9));
mac define _lowfen2=_result(9)-3*(_result(11)-_result(9));
mac define _hifen2= _result(11)+3*(_result(11)-_result(9));
quietly summ tempvar if tempvar>_result(8) & tempvar<_result(12);
di in gr "10 trim= " in ye %6.0g _result(3);
di in gr _col(48) "low " _col(60) "high";
di in gr _col(48) "-------------------";
di in gr _col(28) " inner fences " in ye _col(48) %6.0g %_lowfen1
_col(60) %6.0g %_hifen1;
quietly count if tempvar<%_lowfen1 & tempvar>=%_lowfen2 & tempvar~=.;
macro define _loout1=_result(1);
quietly count if tempvar>%_hifen1 & tempvar<=%_hifen2 & tempvar~=.;
macro define _hiout1=_result(1);
di in gr _col(28) "# mild outliers " in ye _col(48) %_loout1
_col(60) %_hiout1;
di in gr _col(28) "% mild outliers "
in ye _col(48) %4.2f 100*%_loout1/%NN in gr "%"
in ye _col(60) %4.2f 100*%_hiout1/%NN in gr "%";
display;
di in gr _col(28) " outer fences " in ye _col(48) %6.0g %_lowfen2
_col(60) %6.0g %_hifen2;
quietly count if tempvar<%_lowfen2 & tempvar~=.;
macro define _loout2=_result(1);
quietly count if tempvar>%_hifen2 & tempvar~=.;
macro define _hiout2=_result(1);
di in gr _col(28) "# severe outliers " in ye _col(48) %_loout2
_col(60) %_hiout2;
di in gr _col(28) "% severe outliers "
in ye _col(48) %4.2f 100*%_loout2/%NN in gr "%"
in ye _col(60) %4.2f 100*%_hiout2/%NN in gr "%";
display;
capture drop tempvar;
#delimit cr
end
---------------------
This is not a literal translation. I changed some small
details as a matter of taste. In 16 years, there are also
some changes in how Stata programmers would typically do things.
--------------------- iqr8
*! NJC 1.0.0 6 June 2007
* Stata 8 rendering of -iqr- from STB-3 sed4 1991 for Stata 2.1
* Outliers and summary statistics for simple normality check
program iqr8
version 8
syntax varname(numeric) [if] [in]
marksample touse
local v "`varlist'"
quietly su `v' if `touse', detail
local N = r(N)
di _n as txt " mean = " as res %8.0g r(mean) ///
as txt "{col 29}std dev. = " as res %8.0g r(sd) ///
as txt "{col 53}n = " as res %8.0f r(N)
di as txt " median = " as res %8.0g r(p50) ///
as txt "{col 22}pseudo std dev. = " ///
as res %8.0g (r(p75) - r(p25)) / 1.349 ///
as txt "{col 51}IQR = " as res %8.0g r(p75) - r(p25)
tempname lowfen1 hifen1 lowfen2 hifen2
scalar `lowfen1' = r(p25) - 1.5 * (r(p75) - r(p25))
scalar `hifen1' = r(p75) + 1.5 * (r(p75) - r(p25))
scalar `lowfen2' = r(p25) - 3 * (r(p75) - r(p25))
scalar `hifen2' = r(p75) + 3 * (r(p75) - r(p25))
su `v' if `v' > r(p10) & `v' < r(p90) & `touse', meanonly
di as txt "10 trim = " as res %8.0g r(mean)
di as txt "{col 48}low" "{col 62}high"
di as txt "{col 48}{hline 22}"
di as txt "{col 33}inner fences " ///
as res "{col 48}" %8.0g `lowfen1' "{col 62}" %8.0g `hifen1'
quietly count if `v' < `lowfen1' & `v' >= `lowfen2' & `touse'
local loout1 = r(N)
quietly count if `v' > `hifen1' & `v' <= `hifen2' & `touse'
local hiout1 = r(N)
di as txt "{col 30}# mild outliers" ///
as res "{col 48}`loout1'" "{col 62}`hiout1'"
di as txt "{col 30}% mild outliers" ///
as res "{col 48}" %4.2f 100 * `loout1' / `N' ///
as res "{col 62}" %4.2f 100 * `hiout1' / `N'
di _n as txt "{col 33}outer fences" ///
as res "{col 48}" %8.0g `lowfen2' "{col 62}" %8.0g `hifen2'
quietly count if `v' < `lowfen2' & `touse'
local loout2 = r(N)
quietly count if `v' > `hifen2' & `touse'
local hiout2 = r(N)
di as txt "{col 28}# severe outliers" ///
as res "{col 48}`loout2'" "{col 62}`hiout2'"
di as txt "{col 28}% severe outliers" ///
as res "{col 48}" %4.2f 100 * `loout2' / `N' ///
as res "{col 62}" %4.2f 100 * `hiout2' / `N'
end
-----------------------
Nick
[email protected]
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Nick Cox
> Sent: 06 June 2007 23:19
> To: [email protected]
> Subject: st: RE: IQR
>
>
> I think you are referring to -iqr- from STB-3. The FAQ
> advises
>
> "Say what command(s) you are using. If they are not part of
> official Stata, say where they come from: the STB/SJ, SSC, or
> other archives."
>
> As you did not follow this advice, I had to puzzle out
> which command you were referring to. No doubt various
> other members of the list fell at the first fence.
>
> Now as to your question: I do not understand what you
> do not understand. The help for -iqr- looks very helpful
> to me. It includes these definitions:
>
> ============================================================
> IQR (Interquartile Range) = 75th percentile - 25th percentile
> Pseudo standard deviation = IQR/1.349
> 10% trim mean = Average of cases between 10th and
> 90th percentiles
> Inner fences = Q(25)-1.5IQR and Q(75)+1.5IQR
> Outer fences = Q(25)-3IQR and Q(75)+3IQR
> Mild outlier = Q(25)-3IQR <= x < Q(25)-1.5IQR or
> Q(75)+1.5IQR < x <= Q(75)+3IQR
> Severe outlier = x < Q(25)-3IQR or x > Q(75)+3IQR
> =============================================================
>
> Thus a "severe outlier" lies more than 3 IQR away from the nearer
> quartile and a "mild outlier" lies more than 1.5 (but not
> more than 3)
> IQR away from the nearer quartile.
>
> These definitions go back to J.W. Tukey. 1977. Exploratory data
> analysis. Reading, MA: Addison-Wesley, except that the definitions
> of quartiles Stata uses are documented at [R] summarize.
>
> These are arbitrary limits. Their main interest is that they are
> sometimes used in boxplots to determine which data points should
> be shown individually.
>
> That said, "getting rid" of severe outliers is, in my view, not
> usually a good idea unless there is independent evidence that
> the data are wholly untrustworthy (e.g. a laboratory record that
> the experiment was grossly disturbed). Dropping values more than
> 3 IQR away from the nearer quartile will in most instances throw
> out important information. It would throw away most major cities
> compared with cities in their country.
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/