Eva Poen <[email protected]> writes,
> [...] I am working on an analysis of some panel data. [...] the
> observations in each period [are not] independent [...] Observations
> (called "subjects") are organised in groups (of 3 or 4 people), which are
> constant over time. Subjects within the groups are dependent (because they
> strategically interact), but groups are independent.
>
> What I plan to do now is some sort of 'bootstrapping'. As I obviously cannot
> include all observations at a time in regressions, I want to draw random
> samples from my data in which only one subject of each group appears at a
> time, and do my analyses on theses independent observations. The goal is to
> repeat this over and over and then to compare results.
>
> Now two questions on this:
>
> 1) After reading manuals and FAQ's and trying a bit around I could not
> find a possibility to do this with Stata's bootstrap capabilities. I
> would be very happy if someone knew a solution to this very special
> kind of bootstrapping programming issue.
>
> 2) I talked to several econometricians on this subject, but they could not
> really tell if this procedure gives valid results on point estimates
> and confidence intervals. Any comments on those statistical issues are
> very much appreciated.
I believe what Eva proposes is statistically valid.
1. The logic of the bootstrap
------------------------------
The mechanics of the bootstrap work like this:
M1. With some dataset D, we estimate a model to obtain estimates b.
M2. To obtain with standard errors for b, we repeatedly draw samples
with replacement from D of the same sample size, reestimate the model
to obtain estimates b_i for the i-th resampling, and we calculate the
standard deviations of b_i. We use those standard deviations as the
standard errors for b.
The justification of the bootstrap step (2) is
J1. If we had access to the population P from which D was drawn, clearly
we could repeatedly draw samples of size N from that, calculate
standard deviations, and use those as standard errors. They would
be the standard errors if we did that an infinite number of times.
We draw samples of that size because we want to evaluate the
variance function at that sample size -- the sample size we used
in the production of b.
J2. We do not really have to do (J1) an infinite number of times;
a large number of times will yield approximate results that are
very good.
J3. We could use D as a proxy for P under the assmption the D is large
enough.
Justification J3 is important to appreciate. D had better be large, because
what we really need for step M2 is the population P and we are pretending that
P==D.
Note that, in performing the bootstrap, there is no accounting for how well
D approximates P. That step is all handwaving. The bootstrap produces
correct standard errors under the assumption that D==P.
Exercise 1
----------
We have dataset D1 of sample size N1 and we carry out steps M1 and M2.
Someone comes to us later and says they have a new dataset D2 (drawn from
the same population). It has N2>N1 observations. Can we carry reperform
M2 using D2? Should we?
Answer: Yes to both, but in reperforming step M2 with D2, we must be careful
to draw N1 observations, not N2. We want to evaluate the variance of the
estimator and sample sizes of N1 because that is what we used to obtain b
in step M1. We should do this because N2 being larger than N1 means that
we can expect D2 to better reflect the population P than D1 did.
In fact, we can do something even better. We could combione the two datasets
and reperform step M2 drawing samples of N1 observations from N1+N2. D1+D2
should be an even better proxy for the population.
In fact, we can do something even better. We could go back and reperform
steps M1 and M2 using all N2 observations. We would estimate on N2 obserations
and evaluate the variance function (step M2) at N2 obsrvations.
In fact, we can do something even better. We could combine the two datasets
and reperform steps M1 and M2 using all N1+N2 observations.
But if for some reason we could not reperform M1, using a better proxy for
P can only make results more accurate.
Eva's problem
-------------
Eva has a dataset D with M independent groups and (say) 3 observations per
group, for a total dataset size of N=3*M. I will assume a fixed 3 observations
per group to make notation simplier, but nothing below hinges on the fact
that I have fixed the number of observations per group.
Eva is concerned about within-group correlation. There are estimators that
would perhaps "handle" the problem, but they require assumptions, and Eva
is so concerned about within-group correlation that she is willing to give
up efficiency to rid herself of the problem. She says: I will sample
one subject from each group and estimate my model on N/3 observations.
Fine. Let the D1 be the dataset drawn from D on which Eva performs her
estimation; N1=N/3.
Eva now wants to calculate the bootstrap variance estimate for the estimate of
b that she obtains. The standard bootstrap way to do this would be to
repeatedly resample N/3 obsrvations from D1. That will yield fine results
under the assumption that D1 is large enough to reflect the population of
groups.
Whether we meet that assumption is an interesting question. Even if D1 had an
infinite number of observations it would still not equal P because Eva has
told us that P has multiple observations per group.
However, let's consider two extremes: the correlation within group is (+/-)1
and the correlation within group is 0. In the first case, there is no
extra information in adding observations within group and so a sample of
one-observation per group is sufficient. In the second case, observations
within group are independent and so on, taking the limit as N->infinity,
a sample of N/3 is equal to N.
Eva in fact has a larger dataset D from which she could draw N/3 observations.
To substitute D for D1 in step M2, the right way to proceed is (1) draw a
sample of N/3 groups from D1 and (2) for each group selected, draw one of the
three observations available in D for the group. This turns out to be
equivalent to simply drawing N/3 observations from D because of the fixed
number of subjects per group that I assumed. If the number of subjects per
group varies, we must use the two-step sampling scheme.
Programming technique
---------------------
Eva cannot use -bs- or -bstrap-. Eva will have to build her own bootstrap
estimator. Eva formed D1 by selecting one subject per group, so we know
that D1 and D have the same groups. Thus, to form our boostrap sample,
we can start with D, cluster sample the groups, and then select one subject
from each group. That is, with D in memory, we will
bsample, cluster(group) idcluster(newgroup)
gen u = uniform()
sort newgroup u
by newgroup: keep if _n==1
The rest of the program is the "standard stuff" to loop over replications,
perform the estimates, and post the results:
program define myboot
args nreps
postfile myres b1 b2 b3 ... using myres.dta, replace
forvalues i=1(1)`nreps' {
qui use D, clear
qui bsample, cluster)(group) idcluster(newgroup)
qui gen u = uniform()
sort newgroup u
qui by newgroup: keep if _n==1
qui <perform estimation>
post myres (_b[v1]) (_b[v2]) (_b[v3]) ...
}
postfile close
use myres, clear
summarize
end
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/