Home  /  Products  /  StataNow  /  Bayesian variable selection for linear regression

<- See more new Stata features

Highlights

  • Model choice and inference

  • Variable selection

  • Sparse modeling

  • Flexible Bayesian approach

  • Global–local shrinkage priors, including horseshoe model

  • Spike-and-slab priors, including Bayesian lasso

  • Inclusion coefficients and probabilities

  • Efficient Gibbs sampling

  • Bayesian predictions and diagnostics

  • See more Bayesian analysis features

With the new bayesselect command, you can perform Bayesian variable selection for linear regression. Account for model uncertainty and perform Bayesian inference. This command is part of StataNow™.

A frequent problem in regression is identifying the subset of predictors that are most relevant to the outcome when you have many potential predictors. Variable selection, also called sparse regression, helps researchers with model interpretability and provides more stable inference.

Stata's Bayesian suite now includes a new command, bayesselect, that implements Bayesian variable selection for the linear model. bayesselect complements existing Stata commands related to variable selection, such as lasso and bmaregress.

bayesselect provides a flexible Bayesian approach to variable selection by using a variety of specially designed priors for coefficients, such as global–local shrinkage and spike-and-slab priors. bayesselect is fully integrated in Stata's Bayesian suite and works seamlessly with all Bayesian postestimation routines.

Let's see it work

Consider the diabetes dataset, which contains records of disease progression in 442 patients along with control factors such as age, gender, body mass index, blood pressure, and measurements on their blood serum (Efron et al. 2004).

. webuse diabetes
(2004 Diabetes progression data)

Following a common procedure in variable-selection methodologies, all variables are standardized so that they have mean 0 and standard deviation of 1. The outcome variable of interest is diabetes, which we regress on the other 10 variables. We assume that not all covariates are of equal importance and that by performing variable selection, we can achieve more efficient inference and improved prediction.

Performing Bayesian variable selection with bayesselect is as simple as any other regression in Stata. We use the default specification of bayesselect, and the only option we add is rseed() for reproducibility. When fitting the model, we exclude the last observation, the 442th, to be used as a test case.

. bayesselect diabetes age sex bmi bp serum1-serum6 in 1/441, rseed(19)

Burn-in ...
Simulation ...

Model summary

Likelihood: diabetes ~ normal(xb_diabetes,{sigma2}) Priors: {diabetes:age ... serum6} ~ glshrinkage(1,{tau},{lambdas}) (1) {diabetes:_cons} ~ normal(0,10000) (1) {sigma2} ~ jeffreys Hyperprior: {tau lambdas} ~ halfcauchy(0,1)
(1) Parameters are elements of the linear form xb_diabetes. Bayesian variable selection MCMC iterations = 12,500 Metropolis–Hastings and Gibbs sampling Burn-in = 2,500 MCMC sample size = 10,000 Global–local shrinkage coefficient prior: Number of obs = 441 Horseshoe(1) Acceptance rate = .8633 Efficiency: min = .1516 avg = .44 Log marginal-likelihood = -475.56099 max = 1
Equal-tailed Inclusion
diabetes Mean Std. dev. MCSE [95% cred. interval] coef.
serum5 .3366779 .0671548 .0013409 .2182886 .4861579 0.73
bmi .3284608 .0414561 .0004504 .247525 .4087023 0.72
bp .1900507 .0411339 .0004625 .1096927 .2718276 0.59
sex -.1279497 .0398828 .0005348 -.2048249 -.0483072 0.50
serum1 -.125876 .127556 .0032762 -.4417064 .0527829 0.44
serum3 -.0929577 .0768004 .001911 -.2375419 .046901 0.42
serum4 .048891 .0733476 .0017154 -.0749584 .2131595 0.32
serum2 .0188206 .1037191 .0023902 -.1479433 .2834479 0.32
serum6 .0300323 .036144 .0006945 -.0308045 .1095077 0.26
age -.001932 .0282981 .000283 -.0620468 .0574765 0.20
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
diabetes
_cons -.0000518 .0337677 .000342 .0000614 -.0658819 .0662833
sigma2 .4978079 .0338884 .000721 .4965312 .4346842 .5678974
tau .2162604 .1237476 .004792 .1864929 .0637095 .5403306

The default variable selection prior used by bayesselect is the horseshoe prior (Carvalho et al. 2009). It is a special case of the so-called global–local shrinkage priors that include local shrinkage factors lambdas, one for each coefficient. The form of this prior is described in the model summary of the command.

The shrinkage factors are transformed into inclusion coefficients and summarized in the last column of the output of bayesselect. The predictor variables in the output are ordered by the estimates' inclusion coefficients. The top three predictors, which have inclusion coefficients greater than 0.5, are serum5, bmi (body mass index), and bp (blood pressure). All three of these predictors have positive effects for the outcome—the posterior mean estimates for their coefficients are 0.34, 0.33, and 0.19, respectively.

In a second output table, below the table with coefficients, bayesselect reports posterior summaries for the constant term, the variance term sigma2, and the global shrinkage parameter tau.

Because we want predictions, we first need to save the simulation results from bayesselect.

. bayesselect, saving(model1sim)
note: file model1sim.dta saved.

We can now use the bayespredict command to predict disease progression for the last patient in the study, observation 442.

. bayespredict double pmean1 in 442, mean

The computed posterior predictive mean is saved in a new variable, pmean1. We will look at this prediction later.

Another popular variable-selection model is the spike-and-slab lasso model (Ročková and George 2018). We request this model by specifying the sslaplace option in bayesselect.

. bayesselect diabetes age sex bmi bp serum1-serum6 in 1/441, sslaplace rseed(19)

Burn-in ...
Simulation ...

Model summary

Likelihood: diabetes ~ normal(xb_diabetes,{sigma2}) Priors: {diabetes:age ... serum6} ~ mixlaplace(1,.01,1,{gammas}) (1) {diabetes:_cons} ~ normal(0,10000) (1) {sigma2} ~ jeffreys Hyperpriors: {gammas} ~ bernoulli({theta}) {theta} ~ beta(1,1)
(1) Parameters are elements of the linear form xb_diabetes. Bayesian variable selection MCMC iterations = 12,500 Metropolis–Hastings and Gibbs sampling Burn-in = 2,500 MCMC sample size = 10,000 Spike-and-slab coefficient prior: Number of obs = 441 Laplace mixture: L(0,.01) and L(0,1) Acceptance rate = .8669 Beta(1,1) for {theta} Efficiency: min = .07772 avg = .5488 Log marginal-likelihood = -498.88187 max = 1
Equal-tailed Inclusion
diabetes Mean Std. dev. MCSE [95% cred. interval] prob.
bmi .3232745 .043208 .0004401 .2389561 .4068552 1.00
serum5 .4009522 .1009956 .0011669 .2128139 .6081692 1.00
bp .1974724 .042964 .0004411 .1120864 .2813779 1.00
sex -.1323147 .0542777 .001947 -.2211694 -.0013984 0.90
serum1 -.196339 .2424597 .008559 -.7612903 .0352727 0.57
serum3 -.0200597 .0878669 .0009696 -.213405 .1878612 0.49
serum4 .0434984 .0739697 .0012912 -.0682323 .23624 0.40
serum2 .0606413 .1446439 .0037786 -.1015047 .4714863 0.36
serum6 .0197704 .0286588 .0004696 -.018216 .0963732 0.23
age .0014778 .0174662 .0001747 -.0360894 .0380146 0.10
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
diabetes
_cons -.0007783 .0353999 .000355 -.0005396 -.0705115 .0695376
sigma2 .5436156 .0620606 .001684 .5342468 .4475337 .6941913
theta .5890643 .1756445 .004652 .5931756 .2441296 .9121046

Instead of the inclusion coefficients of the horseshoe prior model, the output of spike-and-slab lasso reports inclusion probabilities, which are easier to interpret. The predictors serum5, bmi, and bp all have perfect inclusion, 1. In other words, there is no uncertainty about the importance of these three predictors. Their coefficient estimates, however, are similar to those of the horseshoe model. Overall, the inclusion probabilities are more spread out, ranging from 0.1 for age to 1, than the inclusion coefficients of the horseshoe model, ranging from 0.2 to 0.7.

Let's save the last simulation results and make a prediction for the last patient in the study.

. bayesselect, saving(model2sim)
note: file model2sim.dta saved.

. bayespredict double pmean2 in 442, mean

To compare the prediction results of the two variable-selection models, we list the record of observation 442.

. list in 442

442. diabetes age sex bmi bp serum1
-1.234009 -.954922 -.9374744 -1.533636 -1.709689 1.758542
  serum2 serum3 serum4 serum5 serum6 pmean1
  .5839875 3.650131 -.829361 -.0886171 .0643526 -1.3141091
pmean2
-1.1757732

The prediction of the spike-and-slab model (-1.18) is closer to the true value (-1.23) than the prediction of the horseshoe model (-1.31). In conclusion, both models correctly predict slowing of disease progression for this patient.

References

Efron, B., T. J. Hastie, I. Johnstone, and R. J. Tibshirani. 2004. Least angle regression. Annals of Statistics 32: 407–499.

Carvalho, C. M., N. G. Polson, and J. G. Scott. 2009. "Handling sparsity via the horseshoe". In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, ed. D. van Dyk and M. Welling, vol 5: 73–80. Clearwater Beach, FL.

Ročková, V., and E. I. George. 2018. The spike-and-slab lasso. Journal of the Royal Statistical Society, Series B 113: 431–444.

Tell me more

Read more about Bayesian analysis in the Stata Bayesian Analysis Reference Manual; see [BAYES] bayesselect.

View all the new features in Stata 18.

Made for data science.

Get started today.