> Hi stata users,
>
> I am trying to run a stata program to detecet outlier in my data set. I found 2 grubbs programs written in stata. Programs are here:
>
> Program # 1.
> _______________________________program begins_____________________________________________
**************************************
* This is grubbs.ado beta version
* Date: Jan, 20,2007
* Version: 1.1
*
* Questions, comments and bug reports :
* [email protected]
*
* Version history :
* v1.1: - Bug correction (odd behavior when -if- is specified).
* - Changes in the default variable names (grubbs_xx2, ...)
* v1.0: Initial release.
* Initial code by A.-C. Disdier and K. Head (available at http://strategy.sauder.ubc.ca/head/grubbs.ado)
**************************************
set more off
cap prog drop grubbs
program grubbs
version 8.0
syntax [varlist(default=none)] [if] [in], [GENerate(string)] [DRop] [LOg] [ITer(integer 16000)] [LEVel(real 95.0)]
********************
* Verifying syntax
********************
if "`varlist'"=="" {
di as error "varlist required"
exit 198
}
if `level'>100 | `level'<1 {
di as error "level() specifies the confidence level, as a percentage. It must be between 1.0 and 100.0"
exit
}
scalar conf=(100-`level')/100
if `iter'<0 {
di as error "iter() must be an integer above 0"
exit
}
if "`drop'"!="" & "`generate'"!="" {
di as error "drop skipped because of generate()"
local drop=""
}
marksample touse
********************
* Grubbs procedure
********************
scalar nbvar=wordcount("`varlist'")
scalar nbnewvar=wordcount("`generate'")
if nbnewvar!=nbvar & nbnewvar!=0 {
di as error "Number of variable names in generate() not equal to number of var, skip to default names"
local generate=""
}
tokenize `"`generate'"'
foreach var of local varlist {
di as result "Variable: `var' " _continue
tempvar centred varmq
local varname="grubbs_`var'"
if ("`generate'"!="") local varname="`1'"
capture confirm new var `varname'
local tempvarname="`varname'"
local i=1
while _rc==110 {
local varname="`tempvarname'`i'"
local i=`i'+1
capture confirm new var `varname'
}
di "(0/1 variable recording which observations are outliers: `varname')."
gen byte `varname'=0
local i = 1
gen byte `varmq' =(`var'==. | `touse'!=1)
scalar cutoff = 10
scalar G = cutoff +1
while G > cutoff & `i'<= `iter' {
qui sum `var' if `varname' == 0 & `touse'
gen `centred' = (abs(`var' -r(mean)))/r(sd)
gsort -`varmq' -`varname' `centred'
scalar cutoff = (r(N)-1)*sqrt(invttail(r(N)-2,conf/(2*r(N)))^2/(r(N)*(r(N)-2+invttail(r(N)-2,conf/(2*r(N)))^2)))
scalar G = `centred'[_N]
if ("`log'"!="" & G > cutoff) {
di as txt "Iteration = " `i' ". T-value: " %5.4f G " so " `var'[_N] " is an outlier"
}
qui replace `varname'= 1 if `centred' == G & G > cutoff
local i = `i'+1
drop `centred'
}
local j=`i'-2
if (`i'<=`iter') di as result "`j' outliers. No more outliers"
else di as error "more than `j' outliers. Increase number of iterations"
drop `varmq'
mac shift
}
if "`drop'"!="" {
tempvar del
egen `del'=rowtotal(grubbs_*)
drop if `del'!=0
drop `del'
capture drop grubbs_*
}
end
> ____________________________________end of program__________________________________
>
>
> Program #2.
> __________________________________________program begins___________________________
> program define grubbs
> * this is a revised version of the original command, it no longer deletes missing obs or outlier
> * instead, it sets "tag_grubbs" =1 if it believes the obs to be an outlier
> * usage: "grubbs myvar .05 10"
> version 8.0
> *arguments:
> * 1= Name of variable
> * 2= Confidence interval (0.05 or 0.01)
> * 3= Max number of iterations
> args xvar conf maxit
> tempvar dev missx
> gen byte tag_grubbs = 0>
> local i = 1
> di "deleting missing values"
> gen byte missx = `xvar'==.
> * initial guess for critical value
> scalar Gcrit = 10
> * start with G > Gcrit (otherwise loop will not begin)
> scalar G = Gcrit +1
> di "maxit = " `maxit'
> di "G= " G
> di "Gcrit = " Gcrit
> while G > Gcrit & `i'<= `maxit' {
> sum `xvar' if tag_grubbs == 0
> local nobs = r(N)
> gen `dev' = (abs(`xvar' -r(mean)))/r(sd)
> gsort -`missx' -tag_grubbs `dev'
> scalar G = `dev'[_N]
> local ct = `conf'/(2*`nobs')
> local ts = invttail(`nobs'-2,`ct')
> scalar Gcrit = (`nobs'-1)*sqrt(`ts'^2/(`nobs'*(`nobs'-2+`ts'^2)))
> di "Iteration = " `i' " Critical G = " Gcrit " Current G = " G
> if (G > Gcrit) di `xvar'[_N] " is an outlier, so tag_grubbs = 1"
> replace tag_grubbs = 1 if `dev' == G & G > Gcrit
> local i = `i'+1
> drop `dev'
> }
>
> if (`i'<=`maxit') di "Grubbs procedure terminated: no more outliers"
> else di "Maximum iterations exceeded: Use larger maxit"
> end
> ___________________________________end of program__________________________________________
>
> Could anyone suggest me which program is better to use. I will appreciate if you please use auto data for variable price as an example to run these programs.
>
> Thanks.
>
> Badri Prasad
> Policy, Reporting and Data Development
> Labour Standards and Workplace Equity
> National Labour Operations Directorate
> HRSDC
> (819) 956 - 8146
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/