In the spotlight: Estimating treatment effects with lasso
Estimating treatment effects in the potential outcome framework is a powerful tool for evaluating the effectiveness of a treatment based on observational data. However, in the presence of high-dimensional data, researchers often face a dilemma about how to build the model. On one hand, we want to have deep insights by making good use of large amounts of data. On the other hand, the more complex the model is, the more difficult it is to fit such a model. We want to include more variables in the model, but the traditional estimation techniques cannot fit such models.
To resolve this conflict, we need to use model selection techniques such as lasso to select the variables that matter. At the same time, we want our estimator to be robust to model selection mistakes. In other words, we want our estimation results to still be valid even if lasso omits some important variables or includes some extra variables.
The new telasso command is designed to estimate treatment effects with many control variables and be robust to model selection mistakes. Through an example that compares two types of lung transplants, we will illustrate the dilemma or conflict of including many variables in the treatment-effects estimation and show how to use telasso to reconcile this conflict.
Lung transplant data and control variables
First, we introduce the data and construct the control variables for both the outcome model and the treatment model.
Suppose we want to compare two types of lung transplants. Bilateral lung transplant (BLT) is usually associated with a higher death rate in the short term after the operation but with a more significant improvement in life quality than the single lung transplant (SLT). As a result, for patients who need to decide between these two treatment options, knowing the effect of BLT (versus SLT) on quality of life is essential. We can measure the quality of life based on an individual’s forced expiratory volume in one second (FEV1).
We have a fictional dataset (lung.dta) inspired by Koch, Vock, and Wolfson (2018). The outcome (fev1p) is FEV1% measured one year after the operation. FEV1% is the percentage of FEV1 that the patient has relative to a healthy person with similar characteristics. The treatment variable (transtype) indicates whether the treatment is BLT or SLT.
To start, we open the dataset and describe it.
. use https://www.stata-press.com/data/r17/lung, clear (Fictional data on lung transplant) . describe *, short
Variable Storage Display Value |
name type format label Variable label |
In addition to our treatment and outcome variables, we have 29 variables that record characteristics of the patients and donors. To construct control variables, we want to use these 29 variables and the interactions among them. It would be tedious to type these variable names one by one to distinguish between continuous and categorical variables. vl is a suite of commands that simplifies this process. First, we use vl set to partition the variables into continuous and categorical variables automatically. The global macro $vlcategorical contains all the categorical variable names, and $vlcontinuous contains all the continuous variable names.
. quietly vl set . display `"$vlcategorical"' diabetesp karn racep sexp lifesvent assisvent o2rest raced smoked cmv deathcause diabetesd > expandd sexd lungalloc genderm racem transtype . display `"$vlcategorical"' diabetesp karn racep sexp lifesvent assisvent o2rest raced smoked cmv deathcause diabetesd > expandd sexd lungalloc genderm racem transtype
Second, we use vl create to create customized variable lists. Specifically, $cvars contains all the continuous variables except the outcome (fev1p), and $fvars contains all the categorical variables except the treatment (transtype). Finally, vl sub substitutes the global macro $allvars with the full second-order interaction between the continuous variables in $cvars and categorical variables in $fvars. We will use $allvars as the control variables for both the outcome model and the treatment model.
. vl create cvars = vlcontinuous - (fev1p) note: $cvars initialized with 12 variables. . vl create fvars = vlcategorical - (transtype) note: $fvars initialized with 17 variables. . vl sub allvars = c.cvars i.fvars c.cvars#i.fvars
Dilemma: To include or not to include?
We have created the control variables, and we want to include all of them to estimate the treatment effect of a single lung transplant versus the bilateral lung transplant. First, however, the question is: Can we fit such a model? So let's try!
teffects is a Stata command that provides multiple estimators for the treatment effects. We will try to use teffects to estimate the treatment effect by including all the controls.
. capture noisily teffects aipw (fev1p $allvars) (transtype $allvars) Note: tmodel mlogit initial estimates did not converge; the model may not be identified treatment 0 has 297 propensity scores less than 1.00e-05 treatment 1 has 205 propensity scores less than 1.00e-05 treatment overlap assumption has been violated; use option osample() to identify the overlap violators
teffects produces an error complaining that the overlap assumption has been violated. The overlap assumption means that each patient has a strictly positive probability of being treated or not treated. In other words, given any patient in the treatment group, the overlap assumption implies that we can find a similar patient in the control group. That is, there is an overlap between the treatment and control groups.
In our example, including all of these controls violates the overlap assumption because some specific combination of values of the control variables appears in either the treatment group or the control group but not both. The more control variables there are, the more difficult it is to satisfy the overlap assumption.
The dilemma is that including all the controls makes the model inestimable, but not including all of them renders our model too simple to approximate the reality.
telasso: Select variables that matter
Now we try to fit the same model as above using telasso. As the name indicates, telasso is a combination of teffects and lasso. So we are using lasso to select variables in the treatment and outcome models and using the selected variables in the treatment-effects estimation.
We assume a linear outcome model and a logit treatment model. We type
. telasso (fev1p $allvars) (transtype $allvars) Estimating lasso for outcome fev1p if tran~e = 0 using plugin method ... Estimating lasso for outcome fev1p if tran~e = 1 using plugin method ... Estimating lasso for treatment tran~e using plugin method ... Estimating ATE ... Treatment-effects lasso estimation Number of observations = 937 Outcome model: linear Number of controls = 454 Treatment model: logit Number of selected controls = 8
Robust | ||
fev1p | Coefficient std. err. z P>|z| [95% conf. interval] | |
ATE | ||
transtype | ||
(BLT vs SLT) | 37.51841 .1606703 233.51 0.000 37.20351 37.83332 | |
POmean | ||
transtype | ||
SLT | 46.4938 .2021582 229.99 0.000 46.09757 46.89002 | |
In contrast to the teffects results in the above section, telasso can estimate the treatment effects when we include all the controls. The difference is that telasso selects only 8 variables among the 454 control variables. So telasso selects only variables that matter.
More importantly, the estimator implemented in telasso is robust to the model selection mistakes made by lasso. Thus, the estimation results are still valid even if some important variables are not included in the eight selected variables or if some extra variables are included in them.
The estimation results can be interpreted as usual. If all the patients were to choose BLT, the FEV1% is expected to be 38% higher than the 46% average expected if all patients were to choose an SLT.
Double machine learning
The estimates obtained above relied on a critical assumption of lasso, the sparsity assumption, which requires that only a small number of the potential covariates are in the “true” model. We can use a double machine learning technique to allow for more covariates in the true model. To do this, we add the xfold(5) option to split the sample into five groups and perform crossfitting, and we add the resample(3) option to repeat the cross-fitting procedure with three samples.
To guarantee that we can later reproduce the estimation results, we also set the random-number seed. We type
. set seed 12345671 . telasso (fev1p $allvars) (transtype $allvars), xfolds(5) resample(3) nolog Treatment-effects lasso estimation Number of observations = 937 Number of controls = 454 Number of selected controls = 16 Outcome model: linear Number of folds in cross-fit = 5 Treatment model: logit Number of resamples = 3
Robust | ||
fev1p | Coefficient std. err. z P>|z| [95% conf. interval] | |
ATE | ||
transtype | ||
(BLT vs SLT) | 37.52837 .1683194 222.96 0.000 37.19847 37.85827 | |
POmean | ||
transtype | ||
SLT | 46.4941 .2040454 227.86 0.000 46.09418 46.89402 | |
The estimated treatment effect is similar to the first telasso command reported, but the selected model included 16 controls instead of 8. The similarity of the estimates across the different specifications suggests that our first model did not violate the sparsity assumption.
Concluding remarks
I showed the conflicts that researchers face when estimating the treatment effects with many control variables and using telasso to solve these conflicts. To learn more about estimating treatment effects using lasso, see [TE] telasso.
Reference
Koch, B., D. M. Vock, and J. Wolfson. 2018. Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 74: 8–17. https://doi.org/10.1111/biom.12736
— by Di Liu
Senior Econometrician and Software Developer