Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Strange results with cluster option
From
"Bottan, Nicolas Luis" <[email protected]>
To
"[email protected]" <[email protected]>
Subject
st: Strange results with cluster option
Date
Mon, 27 Sep 2010 10:22:58 -0400
Hi everyone,
I’m obtaining strange results using the cluster option when performing OLS (basically, the standard error increases when increasing cluster size – there is large heterogeneity in cluster size).
I am attaching a simple Monte Carlo simulation in Stata to check whether the cluster option is working fine.
I construct a simple example where an outcome Y is the sum of a school random variable and a student random variable. Both have mean 0 and standard deviation 1.
I test the null hypothesis that the mean of Y is zero for each simulation. Because the null hypothesis is true, it should rejected only 5% of the times. Using the cluster option in Stata is rejected around 35% of the times. Alternatively, collapsing the data at the school level and then running Y on a constant (giving the same weight to all schools) the null is rejected 4% of the times.
Any thoughts?
Thanks!
Here is the code:
* THIS DO FILE GENERATES A MONTE CARLO SIMULATION TO CHECK WHETHER THE CLUSTER OPTION OF THE REG COMMAND IN STATA IS
* WORKING WELL. ALSO IT CHECKS TWO ALTERNATIVE OPTIONS TO ESTIMATE STANDARD ERRORS WHEN OBSERVATIONS ARE CLUSTERED
* TO THAT END, IT ASSUMES THAT:
* Yij=Vj+Uij
* where Y is some outcome variable defined at the student level, Vj is a school effect and Uij is a student effect
* V and U are independent and they are distributed normal with mean 0 and standard deviation 1.
* In the data, there are 100 schools. In 99 schools there is only one observation of a student. In one school there are
* observations of 101 students
* We test the null hypothesis that the mean of Y is zero. By construction this null is true. Then, we run 500 simulations
* and we record in how many cases we reject the null under three different estimation strategies. In the first one we
* use the cluster option in the regression command. In the second one we collapse the data at the school level (averaging Y)
* and then run a regression of Y on a constant weighting observations by the number of students in the school. In the third
* one we do the same procedure as in the second one but we give the same weight to all 100 schools
* As we run 500 simulations, the different alternative estimations, if they are working well, they should be rejecting the
* null approximately 25 times at the 5% level
set seed 111111
local ctarech1=0
local ctarech2=0
local ctarech3=0
foreach it of numlist 1/500 {
qui {
clear
set obs 200
gen j=_n
replace j=100 if j>100
bysort j: gen i=_n
gen v=rnormal()
gen u=rnormal()
replace v=-10 if i>1
egen aux=max(v),by(j)
gen v2=aux
replace v=v2
drop aux v2
gen y=v+u
reg y,cluster(j)
local a=abs(_b[_cons]/_se[_cons])
if `a'>1.96 {
local ctarech1=`ctarech1'+1
}
gen count=1
collapse y (sum) count,by(j)
reg y [pw=count]
local a=abs(_b[_cons]/_se[_cons])
if `a'>1.96 {
local ctarech2=`ctarech2'+1
}
reg y
local a=abs(_b[_cons]/_se[_cons])
if `a'>1.96 {
local ctarech3=`ctarech3'+1
}
}
}
display "it=`it'"
display "ctarech1=`ctarech1'"
display "ctarech2=`ctarech2'"
display "ctarech3=`ctarech3'"
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/