Laplace prior for specifying L1 penalty for coefficients
Gamma prior for the penalty hyperparameter
Bayesian criteria for selecting important predictors
See more Bayesian analysis features
Stata has a suite of commands for performing lasso-based prediction and inference. But did you know that you can fit Bayesian lasso linear models by using bayes: regress? And that inference can be performed in the usual Bayesian way, just as with any other model?
Lasso uses L1-constrained least squares to estimate its coefficients. This involves selecting a value for the penalty parameter, which is often done using cross-validation.
The Bayesian approach is different. Constraints are imposed through prior distributions. The Laplace prior provides constraints that have the same analytic form as the L1 penalty used in lasso. The Laplace prior introduces the penalty parameter as one more model parameter that needs to be estimated from the data. One advantage of the Bayesian approach is that the reported credible intervals can be used to obtain proper inference for the parameters of interest. Within the frequentist approach, special methods are needed to obtain proper inference; see Lasso for inference.
To fit a lasso-style model, use the bayes: prefix or the bayesmh command with Laplace priors. No variable selection is performed automatically, but Bayesian analysis offers various ways to select variables. One is to use bayesstats summary to display a table of the posterior probabilities that each coefficient is different from 0 and select variables based on them.
Let's look at an example. Consider the diabetes data (Efron et al. 2004) on 442 diabetes patients that record a measure of disease progression (one year after baseline) as an outcome y and 10 baseline covariates: age, sex, body mass index, mean arterial pressure, and 6 blood serum measurements. We consider the version of these data in which the covariates are standardized to have zero mean and the sum of squares across all observations of one.
We would like to fit a linear regression model to y and determine which variables are important for predicting y. Efron et al. (2004) applied the least-angle-regression variable-selection method to these data. Park and Casella (2008) introduced Bayesian lasso and used it to analyze these same data. Below, we follow their approach to demonstrate how this can be done using bayes: regress.
For comparison, let's first use the traditional lasso command to select the important predictors.
. lasso linear y age sex bmi map tc ldl hdl tch ltg glu, nolog rseed(16) Lasso linear model No. of obs = 442 No. of covariates = 10 Selection: Cross-validation No. of CV folds = 10
No. of Out-of- CV mean | ||||
nonzero sample prediction | ||||
ID | Description lambda coef. R-squared error | |||
1 | first lambda 45.16003 0 0.0008 5934.909 | |||
47 | lambda before .6254151 8 0.4896 3026.453 | |||
* 48 | selected lambda .5698549 8 0.4896 3026.42 | |||
49 | lambda after .5192306 8 0.4896 3026.567 | |||
83 | last lambda .0219595 10 0.4891 3029.735 | |||
The estimated penalty parameter lambda for the selected model is 0.57. We can see the selected coefficients and their penalized estimates by typing
. lassocoef, display(coef, penalized)
active | ||||
sex | -213.4102 | |||
bmi | 524.8137 | |||
map | 306.6641 | |||
tc | -154.2327 | |||
hdl | -184.5127 | |||
tch | 58.66197 | |||
ltg | 523.1158 | |||
glu | 60.12939 | |||
_cons | 152.1335 | |||
We now use bayes: regress to fit a Bayesian linear lasso model as described by Park and Casella (2008).
Let's look at the command specification first.
. bayes, prior({y:age sex bmi map tc ldl hdl tch ltg glu}, laplace(0, (sqrt({sigma2}/{lam2})))) prior({sigma2}, jeffreys) prior({y:_cons}, normal(0, 1e6)) prior({lam2=1}, gamma(1, 1/1.78)) block({y:} {sigma2} {lam2}, split) rseed(16) dots: : regress y age sex bmi map tc ldl hdl tch ltg glu
For the coefficients, we use the Laplace prior with zero mean and the scale parameter that depends on the variance {sigma2} and squared penalization hyperparameter {lam2}. The authors suggest using the gamma prior with shape 1 and rate 1.78 (or scale 1/1.78) as the prior for {lam2}. For the intercept {y:_cons} and variance, we use vague priors: normal(0, 1e6) and jeffreys. To improve sampling efficiency for these data, we sample all parameters in separate blocks.
Let's now run the model.
. bayes, prior({y:age sex bmi map tc ldl hdl tch ltg glu}, laplace(0, (sqrt({sigma2}/{lam2})))) prior({sigma2}, jeffreys) prior({y:_cons}, normal(0, 1e6)) prior({lam2=1}, gamma(1, 1/1.78)) block({y:} {sigma2} {lam2}, split) rseed(16) dots : regress y age sex bmi map tc ldl hdl tch ltg glu Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done Simulation 10000 .........1000.........2000.........3000.........4000.........5 > 000.........6000.........7000.........8000.........9000.........10000 done Model summary
Equal-tailed | ||||
Mean Std. dev. MCSE Median [95% cred. interval] | ||||
y | ||||
age | -2.478525 52.97851 1.26623 -2.593401 -108.3303 104.2442 | |||
sex | -209.4461 61.21979 1.70006 -211.1479 -330.2515 -85.52568 | |||
bmi | 522.1367 66.76557 1.8115 520.6348 393.4224 656.4993 | |||
map | 304.1617 65.26244 1.77912 306.1749 175.0365 432.3554 | |||
tc | -172.2847 157.5097 12.7739 -159.4816 -523.0447 110.226 | |||
ldl | 1.304382 128.3598 9.96343 -7.796492 -251.2571 298.4382 | |||
hdl | -158.8146 112.6562 6.82563 -158.1347 -378.4126 48.93263 | |||
tch | 91.27437 111.8483 6.06667 86.32462 -114.675 319.0824 | |||
ltg | 515.5167 94.06607 5.83902 509.9952 342.9893 715.739 | |||
glu | 67.94583 62.86024 1.69235 66.11433 -51.1174 197.7894 | |||
_cons | 152.0964 2.545592 .053095 152.0963 146.9166 157.1342 | |||
sigma2 | 2961.246 207.0183 4.79372 2949.282 2587.023 3389.206 | |||
lam2 | .0889046 .055257 .001899 .0769573 .020454 .229755 | |||
The coefficient estimates above are somewhat similar to the penalized estimates of the variables selected by lasso.
We can compare the estimates of the penalized parameters:
. bayesstats summary (lambda:sqrt({lam2})) Posterior summary statistics MCMC sample size = 10,000 lambda : sqrt({lam2})
Equal-tailed | ||||
Mean Std. dev. MCSE Median [95% cred. interval] | ||||
lambda | .285446 .0861738 .003292 .2774119 .1430174 .4793277 | |||
Bayesian lasso estimated lambda to be 0.29, which is smaller than lasso's 0.57, but this estimate is not directly comparable because lasso standardizes covariates to have the scale of 1 during the computation.
Although Bayesian lasso does not automatically select variables, we can use some of Bayesian postestimation tools to help us select the variables. For instance, we can use bayesstats summary to compute the posterior probabilities that each coefficient is less than 0. If the probability is close to 0.5, then the coefficient estimate is centered around 0, and the corresponding variable should not be selected in the model.
. bayesstats summary (age:{y:age}<0) (sex:{y:sex}<0) (bmi:{y:bmi}<0) (map:{y:map}<0) (tc:{y:tc}<0) (ldl:{y:ldl}<0) (hdl:{y:hdl}<0) (tch:{y:tch}<0) (ltg:{y:ltg}<0) (glu:{y:glu}<0) Posterior summary statistics MCMC sample size = 10,000 age : {y:age}<0 sex : {y:sex}<0 bmi : {y:bmi}<0 map : {y:map}<0 tc : {y:tc}<0 ldl : {y:ldl}<0 hdl : {y:hdl}<0 tch : {y:tch}<0 ltg : {y:ltg}<0 glu : {y:glu}<0
Equal-tailed | ||||
Mean Std. dev. MCSE Median [95% cred. interval] | ||||
age | .5277 .4992571 .011353 1 0 1 | |||
sex | .9997 .0173188 .000224 1 1 1 | |||
bmi | 0 0 0 0 0 0 | |||
map | 0 0 0 0 0 0 | |||
tc | .8815 .3232154 .018971 1 0 1 | |||
ldl | .5301 .4991181 .029122 1 0 1 | |||
hdl | .9226 .2672384 .01198 1 0 1 | |||
tch | .1992 .3994187 .016507 0 0 1 | |||
ltg | .1992 .3994187 .016507 0 0 1 | |||
glu | .1417 .3487596 .008577 0 0 1 | |||
There are two variables, age and ldl, for which the probability that their coefficients are less than 0 is close to 0.5. So, we would drop these two variables from the model. Thus, we arrive at the same set of variables as selected earlier by lasso.
Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani 2004. Least angle regression. The Annals of Statistics 32: 407–499.
Park, T.,and G. Casella. 2008. Bayesian lasso. Journal of the American Statistical Association 103: 681–686.
See all Bayesian features and the Stata Bayesian Analysis Reference Manual.