Lasso for inference

Order

Watch video demo

<- See Stata's other features

Highlights

Methods

Double selection
Partialing out
Cross-fit partialing out

Models

Linear regression
Instrumental variables
Logistic (logit) regression
Poisson regression

Postestimation

Inference statistics for specified variables of interest
Joint hypotheses
Save estimation results to disk, including underlying lassos
Examine underlying lassos

We are increasingly faced with more and more data and with harder and harder questions.

Need to sort relevant from irrelevant variables? Try lasso.
Unsure how control variables affect your outcome? Try lasso.
Concerned about nonlinearities and interactions? Try lasso.

The lasso and some other machine learning techniques are reshaping the dialog about how we perform inference. They let us focus on our questions of interest and be less concerned about the unimportant parts of our model. The remainder of our model can be adequately captured by sifting through hundreds or even thousands of potential covariates or a highly nonlinear expansion of potential covariates.

Focus on what interests you and let lasso discover the features that adequately represent the rest of your model.

Stata's lasso for inference commands reports coefficients, standard errors, etc. for specified variables of interest and uses lasso to select the other covariates (controls) that need to appear in the model from the potential control variables you specify.

The inference methods are robust to model-selection mistakes that lasso might make.

Lasso is intended for prediction and selects covariates that are jointly correlated with the variables that belong in the best-approximating model. Said differently, lasso estimates the variables that belong in the model. Like all estimation, this is subject to error.

However you put it, the inference methods are robust to these errors if the true variables are among the potential control variables that you specify.

Let's see it work

We will show you three examples.

Double selection, linear regression
Double selection, logistic regression
Cross-fit partialing out, instrumental variables

Example 1: Double selection, linear regression

We are about to use double selection, but the example below applies to all the methods. Rather than using dsregress, you could have used poregress or xporegress.

We have data on 4,642 birthweights and 22 variables about the baby's mother and father. We want to know whether the mother's smoking and education affect birthweight. The variables of interest are:

i.msmoke	how much the mother smokes (categorical)
medu	mother's education (years of schooling)

i. is how categorical variables are written in Stata.

We are going to specify the control variables as follows:

continuous:
mage	mother's age
fedu	father's education
monthslb	months since mother last gave birth

categorical:
i.foreign	if mother is foreign born (0/1)
i.alcohol	if mother drinks during pregnancy (0/1)
i.prenatal1	prenatal visit in one trimester (0/1)
i.mmarried	if mother is married to father (0/1)
i.order	birth order of infant (0th, 1st, 2nd)

We worry that interactions might also be important, so we are going to fit the model of bweight on i.msmoke and medu and

i.foreign

i.alcohol##i.prenatal1

i.mmarried#(c.mage##c.mage)

i.order##(c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu)

That is a total of 104 covariates. Yet we do not worry about overfitting the model, because the control variables that we specify are potential control variables. Lasso will select the relevant ones.

The command dsregress will select the covariates and present the results for the covariates of interest:

. dsregress  bweight  i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu))

Estimating lasso for bweight using plugin
Estimating lasso for 1bn.msmoke using plugin
Estimating lasso for 2bn.msmoke using plugin
Estimating lasso for 3bn.msmoke using plugin
Estimating lasso for medu using plugin

Double-selection linear model         Number of obs               =      4,642
                                      Number of controls          =        104
                                      Number of selected controls =         15
                                      Wald chi2(4)                =      94.48
                                      Prob > chi2                 =     0.0000



                             Robust
     bweight   Coefficient  std. err.      z    P>|z|     [95% conf. interval]

      msmoke   
  1-5 daily     -157.5933   36.54639    -4.31   0.000     -229.223   -85.96374
 6-10 daily     -215.8084   34.53717    -6.25   0.000       -283.5   -148.1168
  11+ daily     -260.0144   34.41246    -7.56   0.000    -327.4616   -192.5672
               
        medu     3.306897   4.321033     0.77   0.444    -5.162172    11.77597


Note: Chi-squared test is a Wald test of the coefficients of the variables
      of interest jointly equal to zero. Lassos select controls for model
      estimation. Type lassoinfo to see number of selected variables in each
      lasso.

We find

the more the mother smokes, the less the baby weighs.
the mother's education affects the birthweight trivially (3 grams/year of education) and is not significant.

Note that the output reports that we specified 104 control variables, and lasso selected 15 of them.

Example 2: Double selection, logistic regression

In the literature, the concern is often about low-birthweight babies, which weigh less than 2,500 grams.

Let's fit the equivalent low-birthweight model. We will specify the same potential control variables, but we will fit the model using dslogit instead of dsregress. We will use dslogit, but if we wanted to use partialing out or cross-fit partialing out, we could also use pologit or xpologit.

Here is the result.

. dslogit lbweight  i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu)) 

Estimating lasso for lbweight using plugin
Estimating lasso for 1bn.msmoke using plugin
Estimating lasso for 2bn.msmoke using plugin
Estimating lasso for 3bn.msmoke using plugin
Estimating lasso for medu using plugin

Double-selection logit model          Number of obs               =      4,636
                                      Number of controls          =        104
                                      Number of selected controls =         18
                                      Wald chi2(4)                =      33.06
                                      Prob > chi2                 =     0.0000



                             Robust
    lbweight   Odds ratio   std. err.      z    P>|z|     [95% conf. interval]

      msmoke   
  1-5 daily      .9083797   .3036388    -0.29   0.774     .4717819    1.749015
 6-10 daily      2.518055   .4837748     4.81   0.000     1.727947    3.669443
  11+ daily      2.042259   .4154557     3.51   0.000     1.370728    3.042778
               
        medu     .9538414   .0300264    -1.50   0.133     .8967696    1.014545


Note: Chi-squared test is a Wald test of the coefficients of the variables
      of interest jointly equal to zero. Lassos select controls for model
      estimation. Type lassoinfo to see number of selected variables in each
      lasso.

Reported are odds ratios. We find

smoking five or fewer cigarettes per day decreases the odds that the baby is born with a low birthweight (the odds ratio is less than 1). The result is not significant, however, and for more than five cigarettes, the more the mother smokes, the greater the odds that the baby will weigh less than 2,500 grams.
the mother's education is still not significant.

Example 3: Instrumental variables, cross-fit partialing out

We found no statistically significant effect of the mother's education when we fit models for birthweight and low birthweight. The mother's education, however, is presumably endogenous. We will specify the same model and add more to it. We are going to specify that medu is endogenous and specify the potential covariates for washing out that endogeneity.

To fit the linear model, we previously typed

. dsregress  bweight  i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu))

Where we specified medu, we will substitute

(medu = potential instruments)

In particular, we will substitute

                 (medu = c.fedu##
                           (c.prenatal#c.prenatal##c.prenatal)##
                           (i.foreign i.mmarried))

There is an additional change we have to make. We fit the original model using double-selection dsregress. Double selection cannot handle instrumental variables, but partialing out and cross-fit partialing out can. We need to change dsregress to poregress or xporegress. We will fit the model using cross-fit partialing out:

. xpoivregress  bweight  i.msmoke (medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)##
     (i.foreign i.mmarried)),
     controls(i.foreign
     i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu))

Cross-fit fold 1 of 10 ...
Estimating lasso for bweight using plugin
output omitted

Cross-fit partialing-out           Number of obs                  =      4,642
IV linear model                    Number of controls             =        104
                                   Number of instruments          =         42
                                   Number of selected controls    =         27
                                   Number of selected instruments =          5
                                   Number of folds in cross-fit   =         10
                                   Number of resamples            =          1
                                   Wald chi2(4)                   =      97.20
                                   Prob > chi2                    =     0.0000



                             Robust
     bweight   Coefficient  std. err.      z    P>|z|     [95% conf. interval]

        medu    -39.27263   40.76139    -0.96   0.335    -119.1635    40.61822
               
      msmoke   
  1-5 daily     -172.9989    38.1835    -4.53   0.000    -247.8372   -98.16065
 6-10 daily     -229.9561   36.82347    -6.24   0.000    -302.1288   -157.7834
  11+ daily     -275.7334   37.11482    -7.43   0.000    -348.4771   -202.9897


Endogenous:   medu
Exogenous:    1bn.msmoke 2bn.msmoke 3bn.msmoke
Note: Chi-squared test is a Wald test of the coefficients of the variables
      of interest jointly equal to zero. Lassos select controls for model
      estimation. Type lassoinfo to see number of selected variables in each
      lasso.

The mother's education is still not significant. Notice that lasso selected 4 instruments from the 22 we specified.

Learn about vl

Don't you wish that the inference command could be shorter? The last command we fit was

. xpoivregress  bweight  i.msmoke (medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)##
     (i.foreign i.mmarried)),
     controls(i.foreign
     i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu))

They can be shorter. We could have fit this command by typing

. xpoivregress bweight i.msmoke (medu = `instr'), controls(`controls')

Stata's vl command makes it easy to construct lists of variables. See [D] vl. We demonstrate the use of vl there.

Tell me more

Read more about Stata's lasso for inference commands in the Stata Lasso Reference Manual; see [LASSO] Lasso inference intro and [LASSO] Inference examples.

See Lasso for Prediction for Stata's other lasso capabilities.

See Nonparametric series regression, which can handle situations in which you know the control variables but not the functional form in which they appear in the true model.

Also see Bayesian lasso.

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies


		Robust
bweight		Coefficient std. err. z P>\|z\| [95% conf. interval]

msmoke
1-5 daily		-157.5933 36.54639 -4.31 0.000 -229.223 -85.96374
6-10 daily		-215.8084 34.53717 -6.25 0.000 -283.5 -148.1168
11+ daily		-260.0144 34.41246 -7.56 0.000 -327.4616 -192.5672

medu		3.306897 4.321033 0.77 0.444 -5.162172 11.77597


		Robust
lbweight		Odds ratio std. err. z P>\|z\| [95% conf. interval]

msmoke
1-5 daily		.9083797 .3036388 -0.29 0.774 .4717819 1.749015
6-10 daily		2.518055 .4837748 4.81 0.000 1.727947 3.669443
11+ daily		2.042259 .4154557 3.51 0.000 1.370728 3.042778

medu		.9538414 .0300264 -1.50 0.133 .8967696 1.014545