Bayesian bootstrap

← See Stata 19's new features

Highlights

Bayesian bootstrap for official or community-contributed commands
Continuous importance weights instead of traditional frequency weights
Priors for sampling observations
Better small-sample performance than classic bootstrap
See more resampling features

Use the new bayesboot prefix to perform Bayesian bootstrap to obtain more precise parameter estimates in small samples and incorporate prior information when sampling observations. Use it with official commands or community-contributed commands.

The Bayesian bootstrap, pioneered by Rubin (1981), offers an alternative to traditional bootstrap methods by leveraging Bayesian principles. Instead of sampling with replacement where observations are either included or excluded, the Bayesian bootstrap assigns continuous importance weights to each observation from a Dirichlet distribution. This approach directly models uncertainty about how representative each data point is of the underlying population.

The Bayesian bootstrap allows us to interpret the representativeness of each observation as a posterior distribution of importance weights. Within this Bayesian framework, researchers can incorporate their prior knowledge when assigning weights to each observation. Additionally, the distribution from which observations are drawn is smooth, whereas the traditional bootstrap methods make discrete inclusion and exclusion decisions. This smoothness makes Bayesian bootstrap immune to certain issues that arise with traditional bootstrap, such as replicates with collinearity or situations where entire categories may not be represented.

The bayesboot command performs Bayesian bootstrap by generating importance replication weights for each observation from a Dirichlet distribution and using them when estimating parameters and statistics. By default, each observation has the same probability of being selected, but you can customize this to include more informative priors for observations by using the priorpowers() option. bayesboot works seamlessly with official and community-contributed commands, similarly to the existing bootstrap prefix.

Let's see it work

-> Bayesian bootstrap and traditional bootstrap

-> Incorporating prior information

-> bayesboot as a wrapper

-> The impact of custom priors

Bayesian bootstrap and traditional bootstrap

Let's compare Bayesian bootstrap with traditional bootstrap by applying them to regression coefficients of a linear regression. We analyze how vehicle price (price) and repair records (rep78) affect fuel efficiency (mpg) by using the auto dataset.

We first perform traditional bootstrap by using the existing bootstrap prefix and then Bayesian bootstrap by using the new bayesboot prefix. We specify the rseed(111) option with both for reproducibility.

. sysuse auto
(1978 automobile data)

. drop if rep78 == .
(5 observations deleted)

. bootstrap, rseed(111): regress mpg price i.rep78
(running regress on estimation sample)

Bootstrap replications (50): .x.......xx........x.........30.........40..x......
> 50 done
x: Error occurred when bootstrap executed regress.

Linear regression                                       Number of obs =     69
                                                        Replications  =     45
                                                        Wald chi2(5)  =  30.44
                                                        Prob > chi2   = 0.0000
                                                        R-squared     = 0.4241
                                                        Adj R-squared = 0.3784
                                                        Root MSE      = 4.6251



                 Observed   Bootstrap                         Normal-based
         mpg   coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
       price    -.0008829   .0002191    -4.03   0.000    -.0013124   -.0004535
               
       rep78   
          2     -.6361411   2.359325    -0.27   0.787    -5.260334    3.988051
          3      .0797594   2.063937     0.04   0.969    -3.965483    4.125002
          4       1.99724   2.207023     0.90   0.365    -2.328446    6.322925
          5      7.554265   3.228696     2.34   0.019     1.226137    13.88239
               
       _cons     25.03013    2.02501    12.36   0.000     21.06118    28.99907

Note: One or more parameters could not be estimated in 5 bootstrap replicates;
      standard-error estimates include only complete replications.

Now let's perform the same analysis using Bayesian bootstrap. We also specify bayesboot's generate() option to save the generated importance weights in the new variables iw1 through iw50 for later comparison.

. bayesboot, rseed(111) generate(iw): regress mpg price i.rep78
(running regress on estimation sample)

Bayesian bootstrap replications (50): .........10.........20.........30.........
> 40.........50 done

Bayesian bootstrap
Observation prior: Improper

Linear regression                                       Number of obs =     69
                                                        Replications  =     50
                                                        Wald chi2(5)  =  44.10
                                                        Prob > chi2   = 0.0000
                                                        R-squared     = 0.4241
                                                        Adj R-squared = 0.3784
                                                        Root MSE      = 4.6251



                            Bayesian                                      
                 Observed   bootstrap                         Normal-based
         mpg   coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
       price    -.0008829   .0001901    -4.64   0.000    -.0012556   -.0005103
               
       rep78   
          2     -.6361411   1.876978    -0.34   0.735     -4.31495    3.042667
          3      .0797594   1.652173     0.05   0.961    -3.158441    3.317959
          4       1.99724   1.926711     1.04   0.300    -1.779045    5.773525
          5      7.554265   2.427152     3.11   0.002     2.797136    12.31139
               
       _cons     25.03013   1.980008    12.64   0.000     21.14938    28.91087

Although both methods lead to similar overall conclusions, an advantage of Bayesian bootstrap can be seen from the replication output. Notice the “x” markers in the traditional bootstrap results. These markers indicate that replications could not be computed, leading to missing values for regression coefficient estimates in those replications. This could happen because of perfect collinearity or because some of rep78's categories do not have any observations to compute a coefficient. In contrast, bayesboot completes all 50 replications without errors.

This improved stability stems from the use of continuous weights by Bayesian bootstrap, as opposed to the discrete resampling of traditional bootstrap. The continuous weighting approach maintains greater numerical stability by avoiding the perfect collinearity that sometimes occurs with discrete resampling.

Incorporating prior information

One of Bayesian bootstrap's key advantages is the ability to incorporate domain knowledge by specifying priors for observations when you have information about the relative importance or reliability of observations.

Below, we explore how different prior values affect estimation precision and statistical significance by using the priorpowers() option to modify the default prior.

. generate priorvar = rbeta(2,7)+2

. bayesboot, priorpowers(priorvar) rseed(111): regress mpg price i.rep78
(running regress on estimation sample)

Bayesian bootstrap replications (50): .........10.........20.........30.........
> 40.........50 done

Bayesian bootstrap
Observation prior: priorvar

Linear regression                                       Number of obs =     69
                                                        Replications  =     50
                                                        Wald chi2(5)  = 124.40
                                                        Prob > chi2   = 0.0000
                                                        R-squared     = 0.4241
                                                        Adj R-squared = 0.3784
                                                        Root MSE      = 4.6251



                            Bayesian                                      
                 Observed   bootstrap                         Normal-based
         mpg   coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
       price    -.0008829   .0001067    -8.28   0.000     -.001092   -.0006738
               
       rep78   
          2     -.6361411   1.002831    -0.63   0.526    -2.601653    1.329371
          3      .0797594   .9792511     0.08   0.935    -1.839538    1.999056
          4       1.99724   .9489378     2.10   0.035     .1373558    3.857124
          5      7.554265   1.481269     5.10   0.000     4.651031     10.4575
               
       _cons     25.03013   1.026821    24.38   0.000     23.01759    27.04266

Looking at the coefficient for 4.rep78, we see that its confidence interval includes 0 with default priors, whereas it does not with our custom priors. This occurs because higher prior values represent stronger belief in the representativeness of the dataset, resulting in narrower confidence intervals.

bayesboot as a wrapper

The bayesboot command is a convenience wrapper that combines the following two features:

The rwgen bayes command, which generates importance weights based on the Bayesian bootstrap method
bootstrap's iweights() option, which applies these weights during estimation

We can replicate the results from bayesboot in the previous example by specifying the following two commands.

. rwgen bayes myiw, priorpowers(priorvar) rseed(111)

. bootstrap, iweights(myiw1-myiw50): regress mpg price i.rep78
(running regress on estimation sample)

Bootstrap replications (50): .........10.........20.........30.........40......
> ...50 done

Linear regression                                       Number of obs =     69
                                                        Replications  =     50
                                                        Wald chi2(5)  = 124.40
                                                        Prob > chi2   = 0.0000
                                                        R-squared     = 0.4241
                                                        Adj R-squared = 0.3784
                                                        Root MSE      = 4.6251



                 Observed   Bootstrap                         Normal-based
         mpg   coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
       price    -.0008829   .0001067    -8.28   0.000     -.001092   -.0006738
               
       rep78   
          2     -.6361411   1.002831    -0.63   0.526    -2.601653    1.329371
          3      .0797594   .9792511     0.08   0.935    -1.839538    1.999056
          4       1.99724   .9489378     2.10   0.035     .1373558    3.857124
          5      7.554265   1.481269     5.10   0.000     4.651031     10.4575
               
       _cons     25.03013   1.026821    24.38   0.000     23.01759    27.04266

The impact of custom priors

To understand how custom priors affect our analysis, let's compare the distributions of the default and custom weights for the first replicate:

. summarize iw1 myiw1


    Variable          Obs        Mean    Std. dev.       Min        Max
   
         iw1           69    .0144928    .0157269   .0001861   .0751048
       myiw1           69    .0144928    .0077043   .0016568   .0385901

The summary statistics reveal important differences in the distributions of weights. Although both sets maintain the same mean (1/69 = 0.0144928), the custom weights based on our higher prior values show substantially lower variability. This difference in variability has a direct impact on our regression results, as we saw earlier.

Reference

Rubin, D. B. 1981. The Bayesian bootstrap. Annals of Statistics 9: 130–134. https://doi.org/10.1214/aos/1176345338.

Tell me more

Read more about Bayesian bootstrap in [R] bayesboot and the rwgen command in [R] rwgen in the Stata Base Reference Manual.

Learn more about Stata's resampling features.

View all the new features in Stata 19, and, in particular, new in resampling.