Bayesian variable selection for linear regression

Bayesian variable selection for linear regression StataNow

StataNow

Order

<- See more new Stata features

Highlights

Model choice and inference
Variable selection
Sparse modeling
Flexible Bayesian approach
Global–local shrinkage priors, including horseshoe model
Spike-and-slab priors, including Bayesian lasso
Inclusion coefficients and probabilities
Efficient Gibbs sampling
Bayesian predictions and diagnostics
See more Bayesian analysis features

With the new bayesselect command, you can perform Bayesian variable selection for linear regression. Account for model uncertainty and perform Bayesian inference. This command is part of StataNow™.

A frequent problem in regression is identifying the subset of predictors that are most relevant to the outcome when you have many potential predictors. Variable selection, also called sparse regression, helps researchers with model interpretability and provides more stable inference.

Stata's Bayesian suite now includes a new command, bayesselect, that implements Bayesian variable selection for the linear model. bayesselect complements existing Stata commands related to variable selection, such as lasso and bmaregress.

bayesselect provides a flexible Bayesian approach to variable selection by using a variety of specially designed priors for coefficients, such as global–local shrinkage and spike-and-slab priors. bayesselect is fully integrated in Stata's Bayesian suite and works seamlessly with all Bayesian postestimation routines.

Let's see it work

Consider the diabetes dataset, which contains records of disease progression in 442 patients along with control factors such as age, gender, body mass index, blood pressure, and measurements on their blood serum (Efron et al. 2004).

. webuse diabetes
(2004 Diabetes progression data)

Following a common procedure in variable-selection methodologies, all variables are standardized so that they have mean 0 and standard deviation of 1. The outcome variable of interest is diabetes, which we regress on the other 10 variables. We assume that not all covariates are of equal importance and that by performing variable selection, we can achieve more efficient inference and improved prediction.

Performing Bayesian variable selection with bayesselect is as simple as any other regression in Stata. We use the default specification of bayesselect, and the only option we add is rseed() for reproducibility. When fitting the model, we exclude the last observation, the 442th, to be used as a test case.

. bayesselect diabetes age sex bmi bp serum1-serum6 in 1/441, rseed(19)

Burn-in ...
Simulation ...

Model summary




Likelihood: 
  diabetes ~ normal(xb_diabetes,{sigma2})

Priors: 
  {diabetes:age ... serum6} ~ glshrinkage(1,{tau},{lambdas})               (1)
           {diabetes:_cons} ~ normal(0,10000)                              (1)
                   {sigma2} ~ jeffreys

Hyperprior: 
  {tau lambdas} ~ halfcauchy(0,1)
                                                                              



(1) Parameters are elements of the linear form xb_diabetes.

Bayesian variable selection                      MCMC iterations  =     12,500
Metropolis–Hastings and Gibbs sampling           Burn-in          =      2,500
                                                 MCMC sample size =     10,000
Global–local shrinkage coefficient prior:        Number of obs    =        441
  Horseshoe(1)                                   Acceptance rate  =      .8633
                                                 Efficiency:  min =      .1516
                                                              avg =        .44
Log marginal-likelihood = -475.56099                          max =          1
                                                                              


                                                    Equal-tailed     Inclusion
    diabetes        Mean   Std. dev.     MCSE   [95% cred. interval]     coef.
   
      serum5    .3366779   .0671548   .0013409   .2182886   .4861579      0.73
         bmi    .3284608   .0414561   .0004504    .247525   .4087023      0.72
          bp    .1900507   .0411339   .0004625   .1096927   .2718276      0.59
         sex   -.1279497   .0398828   .0005348  -.2048249  -.0483072      0.50
      serum1    -.125876    .127556   .0032762  -.4417064   .0527829      0.44
      serum3   -.0929577   .0768004    .001911  -.2375419    .046901      0.42
      serum4     .048891   .0733476   .0017154  -.0749584   .2131595      0.32
      serum2    .0188206   .1037191   .0023902  -.1479433   .2834479      0.32
      serum6    .0300323    .036144   .0006945  -.0308045   .1095077      0.26
         age    -.001932   .0282981    .000283  -.0620468   .0574765      0.20





                                                              Equal-tailed    
                    Mean   Std. dev.     MCSE     Median  [95% cred. interval]
   
diabetes                                                                      
       _cons   -.0000518   .0337677   .000342   .0000614  -.0658819   .0662833
   
      sigma2    .4978079   .0338884   .000721   .4965312   .4346842   .5678974
         tau    .2162604   .1237476   .004792   .1864929   .0637095   .5403306

The default variable selection prior used by bayesselect is the horseshoe prior (Carvalho et al. 2009). It is a special case of the so-called global–local shrinkage priors that include local shrinkage factors lambdas, one for each coefficient. The form of this prior is described in the model summary of the command.

The shrinkage factors are transformed into inclusion coefficients and summarized in the last column of the output of bayesselect. The predictor variables in the output are ordered by the estimates' inclusion coefficients. The top three predictors, which have inclusion coefficients greater than 0.5, are serum5, bmi (body mass index), and bp (blood pressure). All three of these predictors have positive effects for the outcome—the posterior mean estimates for their coefficients are 0.34, 0.33, and 0.19, respectively.

In a second output table, below the table with coefficients, bayesselect reports posterior summaries for the constant term, the variance term sigma2, and the global shrinkage parameter tau.

Because we want predictions, we first need to save the simulation results from bayesselect.

. bayesselect, saving(model1sim)
note: file model1sim.dta saved.

We can now use the bayespredict command to predict disease progression for the last patient in the study, observation 442.

. bayespredict double pmean1 in 442, mean

The computed posterior predictive mean is saved in a new variable, pmean1. We will look at this prediction later.

Another popular variable-selection model is the spike-and-slab lasso model (Ročková and George 2018). We request this model by specifying the sslaplace option in bayesselect.

. bayesselect diabetes age sex bmi bp serum1-serum6 in 1/441, sslaplace rseed(19)

Burn-in ...
Simulation ...

Model summary




Likelihood:
  diabetes ~ normal(xb_diabetes,{sigma2})

Priors:
  {diabetes:age ... serum6} ~ mixlaplace(1,.01,1,{gammas})                 (1)
           {diabetes:_cons} ~ normal(0,10000)                              (1)
                   {sigma2} ~ jeffreys

Hyperpriors:
  {gammas} ~ bernoulli({theta})
   {theta} ~ beta(1,1)                                                         
                                                                              


(1) Parameters are elements of the linear form xb_diabetes.

Bayesian variable selection                      MCMC iterations  =     12,500
Metropolis–Hastings and Gibbs sampling           Burn-in          =      2,500
                                                 MCMC sample size =     10,000
Spike-and-slab coefficient prior:                Number of obs    =        441
  Laplace mixture: L(0,.01) and L(0,1)           Acceptance rate  =      .8669
  Beta(1,1) for {theta}                          Efficiency:  min =     .07772
                                                              avg =      .5488
Log marginal-likelihood = -498.88187                          max =          1
                                                                              


                                                    Equal-tailed     Inclusion
    diabetes        Mean   Std. dev.     MCSE   [95% cred. interval]     prob.
   
         bmi    .3232745    .043208   .0004401   .2389561   .4068552      1.00
      serum5    .4009522   .1009956   .0011669   .2128139   .6081692      1.00
          bp    .1974724    .042964   .0004411   .1120864   .2813779      1.00
         sex   -.1323147   .0542777    .001947  -.2211694  -.0013984      0.90
      serum1    -.196339   .2424597    .008559  -.7612903   .0352727      0.57
      serum3   -.0200597   .0878669   .0009696   -.213405   .1878612      0.49
      serum4    .0434984   .0739697   .0012912  -.0682323     .23624      0.40
      serum2    .0606413   .1446439   .0037786  -.1015047   .4714863      0.36
      serum6    .0197704   .0286588   .0004696   -.018216   .0963732      0.23
         age    .0014778   .0174662   .0001747  -.0360894   .0380146      0.10





                                                              Equal-tailed    
                    Mean   Std. dev.     MCSE     Median  [95% cred. interval]
   
diabetes                                                                      
       _cons   -.0007783   .0353999   .000355  -.0005396  -.0705115   .0695376
   
      sigma2    .5436156   .0620606   .001684   .5342468   .4475337   .6941913
       theta    .5890643   .1756445   .004652   .5931756   .2441296   .9121046

Instead of the inclusion coefficients of the horseshoe prior model, the output of spike-and-slab lasso reports inclusion probabilities, which are easier to interpret. The predictors serum5, bmi, and bp all have perfect inclusion, 1. In other words, there is no uncertainty about the importance of these three predictors. Their coefficient estimates, however, are similar to those of the horseshoe model. Overall, the inclusion probabilities are more spread out, ranging from 0.1 for age to 1, than the inclusion coefficients of the horseshoe model, ranging from 0.2 to 0.7.

Let's save the last simulation results and make a prediction for the last patient in the study.

. bayesselect, saving(model2sim)
note: file model2sim.dta saved.

. bayespredict double pmean2 in 442, mean

To compare the prediction results of the two variable-selection models, we list the record of observation 442.

. list in 442


442.  diabetes       age        sex        bmi         bp    serum1 
     -1.234009  -.954922  -.9374744  -1.533636  -1.709689  1.758542 


           serum2    serum3    serum4     serum5    serum6      pmean1  
        .5839875  3.650131  -.829361  -.0886171  .0643526  -1.3141091  
                                      pmean2                              
                                  -1.1757732

The prediction of the spike-and-slab model (-1.18) is closer to the true value (-1.23) than the prediction of the horseshoe model (-1.31). In conclusion, both models correctly predict slowing of disease progression for this patient.

References

Efron, B., T. J. Hastie, I. Johnstone, and R. J. Tibshirani. 2004. Least angle regression. Annals of Statistics 32: 407–499.

Carvalho, C. M., N. G. Polson, and J. G. Scott. 2009. "Handling sparsity via the horseshoe". In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, ed. D. van Dyk and M. Welling, vol 5: 73–80. Clearwater Beach, FL.

Ročková, V., and E. I. George. 2018. The spike-and-slab lasso. Journal of the Royal Statistical Society, Series B 113: 431–444.

Tell me more

Read more about Bayesian analysis in the Stata Bayesian Analysis Reference Manual; see [BAYES] bayesselect.

View all the new features in Stata 18.

Made for data science.

Get started today.

Order

Upgrade