Title | Bootstrap with panel data | |
Author | Gustavo Sanchez, StataCorp |
In general, the bootstrap is used in statistics as a resampling method to approximate standard errors, confidence intervals, and p-values for test statistics, based on the sample data. This method is significantly helpful when the theoretical distribution of the test statistic is unknown. In Stata, you can use the bootstrap command or the vce(bootstrap) option (available for many estimation commands) to bootstrap the standard errors of the parameter estimates. We recommend using the vce() option whenever possible because it already accounts for the specific characteristics of the data. This adjustment is particularly relevant for panel data where the randomly selected observations for the bootstrap cannot be chosen by individual record but must be chosen by panel.
In the vce() option we can include all the specifications we would regularly include in the bootstrap command. For example, if we need to perform a test on a linear combination of some of the coefficients of the regression model, we can directly incorporate the linear combination expression into vce(). The example below shows the bootstrap for the standard errors of the difference between the coefficients for age and wks_work on a fixed-effects regression for ln_wage:
. webuse nlswork (National Longitudinal Survey of Young Women, 14-24 years old in 1968) . xtset idcode Panel variable: idcode (unbalanced) . xtreg ln_wage wks_work age tenure ttl_exp, fe > vce(bootstrap (_b[age] - _b[wks_work]),rep(10) seed(123)) (running xtreg on estimation sample) Bootstrap replications (10): .........10 done Bootstrap results Number of obs = 27,408 Replications = 10 Command: xtreg ln_wage wks_work age tenure ttl_exp, fe _bs_1: _b[age] - _b[wks_work] (Replications based on 4,674 clusters in idcode)
Observed Bootstrap Normal-based | ||
coefficient std. err. z P>|z| [95% conf. interval] | ||
_bs_1 | -.0056473 .0011328 -4.99 0.000 -.0078675 -.003427 | |
As we mentioned above, we can get the same results with the bootstrap command. However, by using the vce() option, we do not have to explicitly specify the panel-data characteristics of our dataset.
With community-contributed commands or with non-estimation commands, we need to use bootstrap because there is no equivalent to the vce() option. The example below shows the bootstrap results for the ratio of the means of the first difference of two variables variables (ttl_exp and hours). We need to let the command know we are dealing with panel data and, therefore, each random selection must correspond to a panel. Moreover, repeated selections of the same panel within one bootstrapped sample should be internally treated as different panels.
Let’s first write a program that computes the ratio of the means of two variables:
program my_xtboot,rclass summarize d.`1',meanonly scalar mean`1'=r(mean) summarize d.`2',meanonly scalar mean`2'=r(mean) return scalar ratio=scalar(mean`1')/scalar(mean`2') end
Next let’s create and set the identifier cluster variables for the bootstrapped panels, and then mark the sample to keep only those observations that do not contain missing values for the variables of interest.
. generate newid = idcode . tsset newid year Panel variable: newid (unbalanced) Time variable: year, 68 to 88, but with gaps Delta: 1 unit . generate sample=1-missing(ttl_exp,hours) . keep if sample (67 observations deleted)
Finally, we perform the simulation, specifying the panel characteristics of the dataset:
. bootstrap ratio=r(ratio),rep(10) seed(123) > cluster(idcode) idcluster(newid) nowarn:my_xtboot ttl_exp hours (running my_xtboot on estimation sample) Bootstrap replications (10): .........10 done Bootstrap results Number of obs = 28,467 Replications = 10 Command: my_xtboot ttl_exp hours ratio: r(ratio) (Replications based on 4,710 clusters in idcode)
Observed Bootstrap Normal-based | ||
coefficient std. err. z P>|z| [95% conf. interval] | ||
ratio | 2.830833 1.542854 1.83 0.067 -.1931047 5.854771 | |
There are two cluster options in the bootstrap command line. The first option, cluster(idcode), identifies the original panel variable in the dataset, whereas the second, idcluster(newid), creates a unique identifier for each of the selected clusters (panels in this case). Thus if some panels were selected more than once, the temporary variable newid would assign a different ID number to each resampled panel. If the two clusters indicators are omitted, bootstrap will not take into account the panel structure of the data; rather, it will construct the simulated samples by randomly selecting individual observations from the pooled data.