Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: Bootstrapping with unbalanced panel
From
Benjamin M Miller <[email protected]>
To
[email protected]
Subject
Re: st: Re: Bootstrapping with unbalanced panel
Date
Fri, 9 Aug 2013 20:16:51 -0700
I'm new to statalist, so hopefully the below is appropriately documented.
I just finished dealing with a similar issue; there are related
problems in several command files. The -bsample- command may not
correctly specify -idcluster()-, but users of the -bootstrap- command
will continue to have problems even if that is resolved.
There have been many related threads on problems with the -cluster-
and -idcluster- options of the -bootstrap- command when using panel
data (ex. http://www.stata.com/statalist/archive/2010-06/msg01295.html,
http://www.stata.com/statalist/archive/2006-05/msg00188.html,
http://www.stata.com/statalist/archive/2011-04/msg01348.html,
http://www.stata.com/statalist/archive/2010-12/msg00654.html).
Hopefully the below explanation sheds some light on these issues.
In a nutshell, the problem is this: When bootstrapping declared panel
data, each resampling requires the panel structure of the data to be
re-declared appropriately. -bootstrap- calls -_bs_loop- to loop over
this sampling process, which in turn calls -bsample- to do the actual
sampling. Even if -bsample- creates the correct -idcluster()-
variable, -_loop_bs- declares structure with the original panel
variable and not any new variable specified by -bsample-. The result
is, of course, "repeated time values within panel - the most likely
cause for this error is misspecifying the cluster(), idcluster(), or
group() option"
Here's some documentation:
I created an test dataset with two random variables X and Y
(distributed U(0,1), but that's unimportant). For a panel structure,
there are ten individual (ID 1-10) with ten observations (Year 2000 -
2009). Hence this dataset has 100 observations and looks like this:
Year ID X Y
2000 1 0.984563 0.9534
2001 1 0.596068 0.67932
...
2009 1 0.363387 0.483985
2000 2 0.636904 0.89323
...
2008 10 0.41201 0.38558
2009 10 0.077976 0.231712
Now we start running some bootstrap commands. I've sent the number of
repetitions to 2 because more is unnecessary for this point. These
three version of -bootstrap- work just fine (I'll only show output for
the last one)
. bootstrap, reps(2): reg Y X
. bootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X
. xtset ID Year
panel variable: ID (strongly balanced)
time variable: Year, 2000 to 2009
delta: 1 unit
. bootstrap, reps(2): reg Y X
(running regress on estimation sample)
Bootstrap replications (2)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..
Linear regression Number of obs = 100
Replications = 2
Wald chi2(1) = 0.00
Prob > chi2 = 0.9473
R-squared = 0.0001
Adj R-squared = -0.0101
Root MSE = 0.2874
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
Y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
X | -.0100076 .1514095 -0.07 0.947 -.3067647 .2867495
_cons | .5055374 .0887203 5.70 0.000 .3316488 .679426
------------------------------------------------------------------------------
Now let's keep the panel structure, but also cluster at the panel
variable level. Because we will inevitably resample some clusters, we
use -idcluster(newID)- to declare a new panel variable should be
created for each subsample, and it will be called "newID". This
variable should assign duplicate clusters unique values. However, we
find
. xtset ID Year
panel variable: ID (strongly balanced)
time variable: Year, 2000 to 2009
delta: 1 unit
. bootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X
(running regress on estimation sample)
Bootstrap replications (2)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
repeated time values within panel
the most likely cause for this error is misspecifying the cluster(),
idcluster(), or group() option
Here's a good question: Why we didn't get complaints of repeated time
values in case three (the one with declared panel data but without
clusters)? We still had declared panel data, and we should still
have had repeated time values within panel. The answer is as follows:
-_loop_bs- does declare panel data using the original panel variable
and not what you told it to in -idcluster()-. However, -bootstrap-
only passes the names of the time time and panel variables to
-_loop_bs- when the -cluster()- option is declared. When -cluster()-
is not declared, the sampling routine doesn't know it is working with
panel data,. Hence it doesn't complain about repeated time values
because it never declares the re-sample to be panel data. This means
you can't use things like lag operators, even on declared panel data:
. xtset ID Year
panel variable: ID (strongly balanced)
time variable: Year, 2000 to 2009
delta: 1 unit
. bootstrap, reps(2): reg Y X L.X
time-series operators are not allowed with bootstrap without panels, see tsset
I fixed this by creating -mybootstrap- which always passes panel
information to -my_loop_bs- (How does/should one share new or edited
.ado files? I assume most users don't want to replicate this
editing.). -my_loop_bs- then sets the variable specified in
-idcluster()- to uniquely identify duplicate clusters and uses that as
the panel variable for each re-sampling. Now the -idcluster()- option
is required for all panel data, and this seems to work.
. xtset ID Year
panel variable: ID (strongly balanced)
time variable: Year, 2000 to 2009
delta: 1 unit
. mybootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X
(running regress on estimation sample)
Bootstrap replications (2)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..
Linear regression Number of obs = 100
Replications = 2
Wald chi2(1) = 0.33
Prob > chi2 = 0.5660
R-squared = 0.0001
Adj R-squared = -0.0101
Root MSE = 0.2874
(Replications based on 10 clusters in ID)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
Y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
X | -.0100076 .0174344 -0.57 0.566 -.0441783 .0241631
_cons | .5055374 .020016 25.26 0.000 .4663068 .544768
------------------------------------------------------------------------------
. mybootstrap, reps(2) idcluster(newID): reg Y X L.X
(running regress on estimation sample)
Bootstrap replications (2)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..
Linear regression Number of obs = 90
Replications = 2
Wald chi2(1) = .
Prob > chi2 = .
R-squared = 0.0034
Adj R-squared = -0.0195
Root MSE = 0.2903
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
Y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
X |
--. | .0555275 .0914356 0.61 0.544 -.1236831 .234738
L1. | -.0152741 .3603821 -0.04 0.966 -.72161 .6910618
|
_cons | .4729284 .265143 1.78 0.074 -.0467423 .9925992
------------------------------------------------------------------------------
This solution sweeps a couple more complex questions under the rug.
First, if we use an -idcluster()- approach on a sample that was not
selected at the cluster level (such as the lag example), we'd be
turning a balanced panel into an unbalanced panel, or an unbalances
panel into a "less" balanced panel. My intuition says because the
lags are missing at random, resulting standard errors should be fine.
But I haven't thought about it deeply.
Second, even after all these fixes, you will still be returned error
messages when your regression includes panel-level fixed effects or
any other set of variables which will necessarily include at least one
variable with no observations when some observations are not sampled.
For good reason -bootstrap- does not return standard errors when the
independent variables have changed. You *can* still get
asymptotically accurate bootstrapped standard errors in this case, but
the edits to .ado files are more complex. If there is demand for
that, I can write something up (I have a clunky but working version,
because that scenario is exactly what made me dig through all those
.ado files).
Hope that helps someone,
Ben
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/