You can now account for clustered data in your lasso analysis. Ignoring clustering may lead to incorrect results in the presence of correlation between observations within the same cluster. But with Stata's lasso commands—both those for prediction and those for inference—you can now obtain results that account for clustering.
With lasso commands for prediction, you simply add the cluster() option. For instance, type
. lasso linear y x1-x5000, cluster(idcode)
to account for possible correlation between observations with the same idcode during model selection. You can do this with lasso models other than linear, such as logit or Poisson, and with variable-selection methods other than lasso, such as elastic net and square-root lasso.
With lasso commands for inference, you add the vce(cluster) option. For instance, type
. poregress y x1, controls(x2-x5000) vce(cluster idcode)
to produce cluster–robust standard errors that account for clustering in idcode using partialing-out lasso for linear outcomes. The vce(cluster) option is supported with all inferential lasso commands, including the new command for treatment-effects estimation with lasso.
We want to fit a linear-lasso model for the log of wages (ln_wage) using a set of variables and their second-order interactions. We have data on the individual's age, age; work experience, experience; job tenure, tenure; whether the individual lives in a rural area, rural; if they live in the South, south; and if they have no college education, nocollege. We want a good prediction of the log of wages given these controls.
We have repeated observations of individuals over time; we have clustered data. Each individual is identified by idcode.
We define the global macro $vars as variables and $controls as the full list of control variables.
. global vars c.(age tenure experience) i.(rural south nocollege)
. global controls ($vars)##($vars)
We fit the lasso model and specify option cluster(idcode) to account for clustering, and we specify option rseed(1234) to make the results reproducible.
. lasso linear ln_wage $controls, cluster(idcode) rseed(1234) 10-fold cross-validation with 100 lambdas ... (output omitted) ... change in the deviance stopping tolerance reached ... last lambda selected Lasso linear model No. of obs = 28,093 No. of covariates = 45 Cluster: idcode No. of clusters = 4,699 Selection: Cross-validation No. of CV folds = 10
No. of Out-of- CV mean | ||
nonzero sample prediction | ||
ID | Description lambda coef. R-squared error | |
1 | first lambda .2261424 0 0.0010 .2526964 | |
81 | lambda before .0001325 23 0.3088 .1748403 | |
* 82 | selected lambda .0001207 23 0.3088 .1748393 | |
83 | lambda after .00011 23 0.3088 .1748406 | |
92 | last lambda .0000476 24 0.3087 .1748552 | |
There are 4,699 clusters. Behind the scenes, the cross-validation procedure draws random samples by idcode to arrive at the optimal lambda.
We could now use the predict command to get predictions of ln_wage.
Suppose we are not interested solely in prediction. Say we want to know the effect of job tenure (tenure) on log wages (ln_wage). All the other variables are treated as potential controls, which lasso may include or exclude from the model. Lasso for inference allows us to obtain the estimate of the effect of tenure and its standard error. Because individuals are correlated over time, we would like to use cluster–robust standard errors at the idcode level. We fit a linear model with double-selection lasso methods by using the dsregress command.
First, we define the global macro $vars2 as the uninteracted control variables, and we interact them to form the complete set of controls in $controls2.
. global vars2 c.(age experience) i.(rural south nocollege)
. global controls2 ($vars2)##($vars2)
To fit the model and estimate cluster–robust standard errors, we use dsregress and specify the option vce(cluster idcode).
. dsregress ln_wage tenure, controls($controls2) vce(cluster idcode) Estimating lasso for ln_wage using plugin Estimating lasso for tenure using plugin Double-selection linear model Number of obs = 28,093 Number of controls = 35 Number of selected controls = 10 Wald chi2(1) = 195.87 Prob > chi2 = 0.0000 (Std. err. adjusted for 4,699 clusters in idcode)
Robust | ||
ln_wage | Coefficient std. err. z P>|z| [95% conf. interval] | |
tenure | .0228961 .001636 14.00 0.000 .0196897 .0261025 | |
The .02 point estimate means that an increase of one year in job tenure would increase the log of wage by .02. The standard-error estimate is robust to the correlated observations within the cluster.
Learn more about Stata's lasso features.
Stata Lasso Reference Manual
[LASSO] inference intro