We are increasingly faced with more and more data and with harder and harder questions.
The lasso and some other machine learning techniques are reshaping the dialog about how we perform inference. They let us focus on our questions of interest and be less concerned about the unimportant parts of our model. The remainder of our model can be adequately captured by sifting through hundreds or even thousands of potential covariates or a highly nonlinear expansion of potential covariates.
Focus on what interests you and let lasso discover the features that adequately represent the rest of your model.
Stata's lasso for inference commands reports coefficients, standard errors, etc. for specified variables of interest and uses lasso to select the other covariates (controls) that need to appear in the model from the potential control variables you specify.
The inference methods are robust to model-selection mistakes that lasso might make.
Lasso is intended for prediction and selects covariates that are jointly correlated with the variables that belong in the best-approximating model. Said differently, lasso estimates the variables that belong in the model. Like all estimation, this is subject to error.
However you put it, the inference methods are robust to these errors if the true variables are among the potential control variables that you specify.
We will show you three examples.
We are about to use double selection, but the example below applies to all the methods. Rather than using dsregress, you could have used poregress or xporegress.
We have data on 4,642 birthweights and 22 variables about the baby's mother and father. We want to know whether the mother's smoking and education affect birthweight. The variables of interest are
i.msmoke | how much the mother smokes (categorical) |
medu | mother's education (years of schooling) |
i. is how categorical variables are written in Stata.
We are going to specify the control variables as follows:
continuous: | |
mage | mother's age |
fedu | father's education |
monthslb | months since mother last gave birth |
categorical: | |
i.foreign | if mother is foreign born (0/1) |
i.alcohol | if mother drinks during pregnancy (0/1) |
i.prenatal1 | prenatal visit in one trimester (0/1) |
i.mmarried | if mother is married to father (0/1) |
i.order | birth order of infant (0th, 1st, 2nd) |
We worry that interactions might also be important, so we are going to fit the model of bweight on i.msmoke and medu and
i.foreign |
i.alcohol##i.prenatal1 |
i.mmarried#(c.mage##c.mage) |
i.order##(c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) |
That is a total of 104 covariates. Yet we do not worry about overfitting the model, because the control variables that we specify are potential control variables. Lasso will select the relevant ones.
The command dsregress will select the covariates and present the results for the covariates of interest:
. dsregress bweight i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) ) Estimating lasso for bweight using plugin Estimating lasso for 1bn.msmoke using plugin Estimating lasso for 2bn.msmoke using plugin Estimating lasso for 3bn.msmoke using plugin Estimating lasso for medu using plugin Double-selection linear model Number of obs = 4,642 Number of controls = 104 Number of selected controls = 15 Wald chi2(4) = 94.48 Prob > chi2 = 0.0000
Robust | ||
bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
msmoke | ||
1-5 daily | -157.5933 36.54639 -4.31 0.000 -229.223 -85.96374 | |
6-10 daily | -215.8084 34.53717 -6.25 0.000 -283.5 -148.1168 | |
11+ daily | -260.0144 34.41246 -7.56 0.000 -327.4616 -192.5672 | |
medu | 3.306897 4.321033 0.77 0.444 -5.162172 11.77597 | |
We find
Note that the output reports that we specified 104 control variables, and lasso selected 15 of them.
In the literature, the concern is often about low-birthweight babies, which weigh less than 2,500 grams.
Let's fit the equivalent low-birthweight model. We will specify the same potential control variables, but we will fit the model using dslogit instead of dsregress. We will use dslogit, but if we wanted to use partialing out or cross-fit partialing out, we could also use pologit or xpologit.
Here is the result.
. dslogit lbweight i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) ) Estimating lasso for lbweight using plugin Estimating lasso for 1bn.msmoke using plugin Estimating lasso for 2bn.msmoke using plugin Estimating lasso for 3bn.msmoke using plugin Estimating lasso for medu using plugin Double-selection logit model Number of obs = 4,636 Number of controls = 104 Number of selected controls = 18 Wald chi2(4) = 33.06 Prob > chi2 = 0.0000
Robust | ||
lbweight | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
msmoke | ||
1-5 daily | .9083797 .3036388 -0.29 0.774 .4717819 1.749015 | |
6-10 daily | 2.518055 .4837748 4.81 0.000 1.727947 3.669443 | |
11+ daily | 2.042259 .4154557 3.51 0.000 1.370728 3.042778 | |
medu | .9538414 .0300264 -1.50 0.133 .8967696 1.014545 | |
Reported are odds ratios. We find
We found no statistically significant effect of the mother's education when we fit models for birthweight and low birthweight. The mother's education, however, is presumably endogenous. We will specify the same model and add more to it. We are going to specify that medu is endogenous and specify the potential covariates for washing out that endogeneity.
To fit the linear model, we previously typed
. dsregress bweight i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) )
Where we specified medu, we will substitute
(medu = potential instruments)
In particular, we will substitute
(medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)## (i.foreign i.mmarried) )
There is an additional change we have to make. We fit the original model using double-selection dsregress. Double selection cannot handle instrumental variables, but partialing out and cross-fit partialing out can. We need to change dsregress to poregress or xporegress. We will fit the model using cross-fit partialing out:
. xpoivregress bweight i.msmoke (medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)## (i.foreign i.mmarried) ), controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) ) Cross-fit fold 1 of 10 ... Estimating lasso for bweight using plugin output omitted Cross-fit partialing-out Number of obs = 4,642 IV linear model Number of controls = 104 Number of instruments = 42 Number of selected controls = 22 Number of selected instruments = 4 Number of folds in cross-fit = 10 Number of resamples = 1 Wald chi2(4) = 93.87 Prob > chi2 = 0.0000
Robust | ||
bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
medu | -5.994852 49.05562 -0.12 0.903 -102.1421 90.1524 | |
msmoke | ||
1-5 daily | -158.1356 38.39086 -4.12 0.000 -233.3804 -82.89094 | |
6-10 daily | -213.5149 38.3374 -5.57 0.000 -288.6548 -138.3749 | |
11+ daily | -259.3824 38.68729 -6.70 0.000 -335.2081 -183.5567 | |
The mother's education is still not significant. Notice that lasso selected 4 instruments from the 22 we specified.
Don't you wish that the inference command could be shorter? The last command we fit was
. xpoivregress bweight i.msmoke (medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)## (i.foreign i.mmarried) ), controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) )
They can be shorter. We could have fit this command by typing
. xpoivregress bweight i.msmoke (medu = `instr'), controls(`controls')
Stata's new vl command makes it easy to construct lists of variables. See [D] vl. We demonstrate the use of vl there.
Read more about Stata's lasso for inference commands in the Stata Lasso Reference Manual; see [LASSO] Lasso inference intro and [LASSO] Inference examples.
See Lasso for Prediction for Stata's other lasso capabilities.
See Nonparametric series regression, which can handle situations in which you know the control variables but not the functional form in which they appear in the true model.
Also see Bayesian lasso.