Extended regression models (ERMs)

Order

Watch video demo

<- See Stata's other features

Highlights

Regression combining common complications
- Endogenous covariates
- Sample selection
- Nonrandom treatment assignment
Outcome types
- Continuous
- Interval-measured (interval-censored)
- Binary
- Ordinal
Endogenous covariate types
- Continuous
- Binary
- Ordinal
- Interactions with other covariates
- Quadratic and other polynomial forms
Treatment effects/Causal inference
- Binary and ordinal treatment
- Average treatment effects (ATEs)
- ATEs on the treated (ATETs)
- ATEs on the untreated (ATEUs)
- Potential-outcome means (POMs)
- ATEs, ATETs, ATEUs, and POMs for
Advanced inferences
- Inference statistics
- Conditional analysis—specify values of all covariates
- Population-averaged—specify values of some covariates, or no covariates, and average (margin) over the rest
- Tests against zero, tests of equality, CIs, and more
- Inferences and plots over groups

We call them ERMs—extended regression models. There are four new commands that fit

linear models
linear models with interval-censored outcomes, including tobit models
probit models
ordered probit models

with any combination of

endogenous covariates
sample selection
nonrandom treatment assignment, both exogenous and endogenous
within-panel correlation

Here are some of the features in discipline-specific terminology:

bias due to unmeasured confounding
trials with informative dropout
causal inference
average causal effects (ACEs)
average treatment effects (ATEs)
simultaneous causality, in linear models
outcomes that are missing not at random (MNAR)
nonignorable nonresponse
selection on unobservables
Heckman selection
random effects
two-level models
within-group correlation

All the above are addressed by one or more of endogenous covariates, sample selection (missingness), and nonrandom treatment assignment. ERMs are not black magic. ERMs let you model the problems that your data have.

Easy syntax

The syntax of ERMs is a command, such as eregress, followed by the main equation and then followed by one or more of the options endogenous(), select(), and entreat() or extreat(). The options may be specified in any combination. For instance,

                  Linear regression of y on x1 and x2
. eregress y x1 x2

                  Make covariate x2 endogenous
. eregress y x1   ,  endogenous(x2 = x3 x4)

                  Add sample selection
. eregress y x1   ,  endogenous(x2 = x3 x4)
                          select(selected = x2 x6)

                  Add exogenous treatment & drop sample selection
. eregress y x1   ,  endogenous(x2 = x3 x4)
                          extreat(treated)

                  Replace exogenous with endogenous treatment
. eregress y x1   ,  endogenous(x2 = x3 x4)
                          entreat(treated = x2 x3 x5)

                  Add sample selection
. eregress y x1   ,  endogenous(x2 = x3 x4)
                          entreat(treated = x2 x3 x5)
                          select(selected = x2 x6)

Look carefully, and you will notice that we specified endogenous covariates in both selection and treatment equations. That ERMs can fit such models is remarkable. ERMs have one syntax and four options. The endogenous() option can be repeated when necessary:

                 Make x2 and x3 endogenous
. eregress y x1, endogenous(x2 = x3 x4)
                  endogenous(x3 = x1 x5)

Endogenous variable x3 in this example appears in the equations for both y and x2. If x3 was not to appear in the main equation, we would have typed

                 Remove x3 from the main equation
. eregress y x1, endogenous(x2 = x3 x4)
		  endogenous(x3 = x1 x5, nomain)

Even when we specify nomain, we can include the variables in the main equation as long as we do so explicitly:

. eregress y x1 x2 x3, endogenous(x2 = x3 x4, nomain) endogenous(x3 = x1 x5, nomain)

The same syntax that works with eregress to fit linear regression models also works with eintreg to fit interval regression models, eprobit to fit probit models, and eoprobit to fit ordered probit models. For instance,

                 y is binary, model is probit
. eprobit y x1, endogenous(x2 = x3 x4)
		 endogenous(x3 = x1 x5, nomain)

Endogenous equations can themselves be probit or ordered probit. In the following model, endogenous covariate x3 is binary, and it is modeled using probit:

                 x3 is now a binary endogenous covariate
. eprobit  y x1, endogenous(x2 = x3 x4)
		  endogenous(x3 = x1 x5, nomain probit)

There is one more syntax extension. Add xt to the beginning of any command name and fit a random-effects model. We can use xteregress, xteintreg, xteprobit, and xteprobit to fit models for panel data. For instance,

. xteregress y x1, endogenous(x2 = x3 x4) endogenous(x3 = x1 x5, nomain)

And for a binary outcome,

. xtprobit y x1, endogenous(x2 = x3 x4) endogenous(x3 = x1 x5, nomain)

Let's see it work

We are going to fit the following model:

. eregress bmi sex steps , endog(steps = sex distance, nomain) select(selected = sex steps education)

We will build up to fitting the model by relating the fictional story behind it, but first, notice that variable steps is endogenous and appears in both the main equation and the selection equation. We will have to account for that endogeneity if we hope to draw a causal inference about the effect of walking on body mass index.

. eregress bmi sex steps , endog(steps = sex distance, nomain) select(selected = sex steps education)

Can ERMs really fit such models? Yes. We ran the model on simulated data and verified that the coefficients we are about to show you match the true parameters. Can other commands of Stata fit the same models as ERMs? Sometimes. There is no other Stata command that will fit a linear model with selection and with an endogenous covariate, but if variable steps were not endogenous, we could fit the model using Stata's heckman command. Nonetheless, ERMs are easier to use. And ERMs provide a richer set of model-interpretation features. Regardless, the important feature of ERMs is that they will fit a wider range of models, like the one we are about to fit:

. eregress bmi sex steps , endog(steps = sex distance, nomain) select(selected = sex steps education)

The story behind this model concerns a (fictional) national study on the benefits of walking. This study is intended to measure those benefits in terms of the effects of steps walked per day (steps) on body-mass index (bmi).

A random sample was drawn and people were recruited to join the experiment. Some declined. We are going to ignore any bias in that. If they agreed, however, they were weighed, their height measured, their educational level recorded, and they were given a pedometer to be returned by prepaid post after six weeks. Some never returned it.

Our statistical concern is that those who did not return the pedometer might be systematically different from those who did. Perhaps they are less likely to exercise. Perhaps their bmi is higher than average. Remember that our goal is to measure the relationship between bmi and steps for the entire population.

Our other statistical concern has to do with unmeasured healthiness. People who walk more may be engaging in other activities that improve their health. We are worried that we have unobserved confounders. Said differently, we are worried that the error in bmi is correlated with the number of steps walked and thus bmi is endogenous.

We fit the model. Here are the results:

. eregress bmi sex steps , endog(steps = sex distance, nomain)
     select(selected = sex steps education)

Iteration 0:   log likelihood = -1422.7302
Iteration 1:   log likelihood = -1420.2741
Iteration 2:   log likelihood = -1419.9652
Iteration 3:   log likelihood = -1419.9611
Iteration 4:   log likelihood = -1419.9611

Extended linear regression                             Number of obs  =    500
                                                          Selected    =    302
                                                          Nonselected =    198

                                                       Wald chi2(2)   = 640.17
Log likelihood = -1419.9611                            Prob > chi2    = 0.0000



               Coefficient  Std. err.      z    P>|z|     [95% conf. interval]

bmi                                                                           
         sex    -1.080003   .3218772    -3.36   0.001    -1.710871   -.4491354
       steps    -2.225672   .0891093   -24.98   0.000    -2.400323   -2.051021
       _cons     35.68498   .5815979    61.36   0.000     34.54507    36.82489

selected                                                                      
         sex     .8330193   .2647175     3.15   0.002     .3141825    1.351856
       steps     .2694679   .0886263     3.04   0.002     .0957635    .4431723
   education     1.053498   .1027103    10.26   0.000     .8521891    1.254806
       _cons    -16.63009   1.963632    -8.47   0.000    -20.47873   -12.78144

steps                                                                         
         sex     .3393479   .1044252     3.25   0.001     .1346783    .5440176
    distance     -.985911   .0240427   -41.01   0.000    -1.033034   -.9387881
       _cons     9.035609   .0711241   127.04   0.000     8.896208     9.17501

   var(e.bmi)     7.916253   .7247563                      6.615911    9.472174
 var(e.steps)     .8907777   .0563377                      .7869273    1.008333

corr(e.sel~d,                                                                  
       e.bmi)     .6676526   .0960975     6.95   0.000     .4355011    .8165333
corr(e.steps,                                                                  
       e.bmi)      .600721   .0400543    15.00   0.000     .5164193    .6734909
corr(e.steps,                                                                  
  e.selected)     .2030564    .123501     1.64   0.100    -.0465152    .4287674

The parameter estimates are presented in five parts.

The first part reports the bmi equation.

The second part reports the selected equation.

The third part reports the steps equation.

The fourth part reports the error variances.

The fifth part reports the correlations between errors.

Let's start with the last part.

We worried that the errors in steps and bmi would be positively correlated, both being affected by unobserved healthiness. The output reports that the errors are indeed correlated. The estimated corr(e.steps, e.bmi) is 0.6, and it is whoppingly significant.

We worried that the error in selected would be correlated with the error in bmi, and it is. The estimated corr(e.selected, e.bmi) is 0.67, and it is significant too.

Our concerns are justified by the data, and, because we specified options selected() and endogenous(), the results reported in the main equation are adjusted for them. The results reported for the bmi equation are just as if we fit the model using ordinary regression on randomly selected data that had none of these problems.

The coefficient in the bmi equation that most interests us is the coefficient on steps. It is -2.23, meaning that bmi is reduced by 2.23 for every 1,000 steps walked per day. This is not a small effect. The average bmi in our data is 23.

Was it important that we accounted for endogeneity and selection? To show you that it was, we ran three other models:

. eregress bmi sex steps

. eregress bmi sex steps , endog(steps = sex distance, nomain)

. eregress bmi sex steps , select(selected = sex steps education)

The coefficients on steps were different in each model and not a single 95 percent confidence included -2.3, the true value under which the data were simulated.

Let's see it work again

Treatment-effect models are popular these days, and for good reason. Much of what researchers do involves evaluations of the effects of drugs, treatments, or programs.

In social sciences, evaluations are usually performed on observational data, another word for naturally occurring data. Even when data are custom to the purpose, they are seldom from well-controlled experiments. People opt in or out voluntarily. Even those who volunteer may not honor their obligations.

Consider the plight of a fictional university wanting to evaluate its freshman program intended to increase students' probabilities of ultimate graduation. This is a classic treatment-effects problem. Some students were treated (took the program) and others were not. The university now wants to measure the effect of the program.

The program was voluntary, meaning that students who are highly motivated might be more likely to participate. If highly motivated students are more likely to graduate in any case and if we ignored this problem, then the program would appear to affect college graduation rates more than it really does.

To measure the effect of the program, we need to do everything possible to control for each student's original chances of success. The model we will fit is

. eprobit graduate income i.roommate, entreat(program = i.campus income) endogenous(hsgpa = income i.hscomp)

The main probit equation models graduation, a 0/1 variable. We model student graduation on parents' income, whether the student had a roommate who was also a student (i.roommate), and high school GPA (hsgpa).

Option entreat() handles the endogenous treatment assignment. We model students' choice of treatment on (1) whether their first-year residence was on campus (i.campus) and (2) their parents' income (income). Both variables, we believe, affect the probability of participation.

Finally, we think high school GPA is endogenous because we believe it is correlated with unobserved ability and motivation.

We fit the model, and the output looks something like this:

. eprobit graduate income i.roommate, entreat(program = i.campus income)
     endogenous(hsgpa   = income i.hscomp)

Extended probit regression                      Number of obs     =      7,127
                                                Wald chi2(8)      =    1122.83
Log likelihood = -7920.6341                     Prob > chi2       =     0.0000



               Coefficient  Std. err.      z    P>|z|     [95% conf. interval]

graduate                                                                      
               (output omitted)                                               

program                                                                       
               (output omitted)                                               

hsgpa                                                                         
               (output omitted)                                               

 var                                                                          
               (output omitted)                                               

corr                                                                          
               (output omitted)

We will show you the omitted parts, but first realize that the output appears in the same groupings as it did in the previous example. Equations are reported first (we have three of them), then variances, and, finally, correlations.

The second equation—program—is the treatment choice model. We want to start there. Our treatment choice model was specified by the entreat() option:

. eprobit graduate income i.roommate, entreat(program = i.campus income) endogenous(hsgpa = income i.hscomp)

The output for the treatment equation is



program                                                                       
      campus     .6629004   .0467013    14.19   0.000     .5713675    .7544334
      income    -.0772836   .0050832   -15.20   0.000    -.0872465   -.0673207
       _cons    -.3417554   .0509131    -6.71   0.000    -.4415433   -.2419675

We find that living on campus and being from a lower-income family increases the chances of students participating in the program. The negative coefficient on income did not surprise us. Our interpretation is that motivated students from poorer families expected they would have more to gain from the program.

Is the error in the treatment equation positively correlated with the error in the graduation equation? That correlation—corr(e.program, e.graduate)—is in the last part of the output,



corr(e.pro~m,                                                                  
  e.graduate)     .2610808   .1162916     2.25   0.025     .0226638    .4713994
corr(e.hsgpa,                                                                  
  e.graduate)     .2905934   .0633915     4.58   0.000      .162068    .4094238
corr(e.hsgpa,                                                                  
   e.program)    -.0024032    .015235    -0.16   0.875    -.0322522    .0274501

corr(e.program, e.graduate) is 0.26 and significant at the 5-percent level, providing evidence that treatment choice was indeed endogenous.

The third equation—hsgpa—is the model for our endogenous covariate—high school GPA. Our endogenous covariate model was specified by the endogenous() option:

. eprobit graduate income i.roommate, entreat(program = i.campus income) endogenous(hsgpa = income i.hscomp)

The output for the endogenous covariate equation is



hsgpa                                                                         
      income     .0429837    .000954    45.06   0.000     .0411139    .0448535
                                                                              
      hscomp                                                                  
   moderate     -.1180259   .0066271   -17.81   0.000    -.1310148    -.105037
       high     -.2064778   .0104663   -19.73   0.000    -.2269914   -.1859643
                                                                              
       _cons     2.711822   .0075609   358.66   0.000     2.697003    2.726641

We find that parents' income is positively related to high school GPA. We also find that school competitiveness (hscomp) matters. Students from moderately competitive high schools have lower high school GPAs, and those from highly competitive schools have still lower GPAs. The more difficult the school, the lower the expected GPA.

Taking all the above into account, we now ask, How did participation affect graduation rates? That will be in the graduate equation. Our model is

. eprobit graduate income i.roommate, entreat(program = i.campus income) endogenous(hsgpa = income i.hscomp)

and the output is



graduate                                                                      
     program#                                                                  
    c.income                                                                  
          0      .1777645   .0140365    12.66   0.000     .1502534    .2052756
          1      .2184452   .0181589    12.03   0.000     .1828543    .2540361
                                                                              
    roommate#                                                                  
     program                                                                  
      yes#0      .4320001   .0477783     9.04   0.000     .3383564    .5256437
      yes#1      .3548558   .0546206     6.50   0.000     .2478015    .4619102
                                                                              
     program#                                                                  
     c.hsgpa                                                                  
          0      1.860516   .3152604     5.90   0.000     1.242617    2.478415
          1      1.542167   .3131915     4.92   0.000     .9283226    2.156011
                                                                              
     program                                                                  
          0     -6.567493   .8892133    -7.39   0.000    -8.310319   -4.824667
          1      -5.18857   .8443761    -6.14   0.000    -6.843517   -3.533623

Look carefully, and you will discover that even though we specified a single probit equation,

. eprobit graduate income i.roommate hsgpa, ...

an interacted-with-choice model was fit, yielding one set of coefficients for program==0 and another for program==1:



  Probit probability model of graduation   

Variable        program==0       program==1

c.income         0.1778           0.2184   
i.roommate       0.4320           0.3546   
c.hsgpa          1.8605           1.5422   
intercept       -6.5675          -5.1889

The graduation model was interacted with program because of the entreat(program = ...) option that we specified. When you specify the option, ERMs fit one model for each value of the treatment variable. This way of measuring treatment effects is more robust than when we allow only the intercept to vary across treatments. It is also more difficult to interpret.

Researchers who fit treatment-effect models are often interested in ATE and ATET. ATE is the average treatment effect—the difference between everyone being treated and everyone being untreated. In this case, that difference is the difference in graduation probabilities.

Postestimation command estat teffects reports the ATE:

. estat teffects

Predictive margins                                       Number of obs = 7,127
Model VCE: OIM



                          Delta-method                                        
                   Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]

ATE                                                                           
     program                                                                  
   (1 vs 0)      .1687485   .0510067     3.31   0.001     .0687772    .2687198

The ATE is a 0.1687 increase in the probability of graduation. That is a hefty increase.

ATE would be relevant if they could make the program required. With effects this large, they should want to think about how they could encourage students to enroll.

The fictional university is probably also interested in ATET—the average treatment effect among the treated. This is the average effect for students who self-selected into the program.

estat teffects will also report the ATET if we specify option atet:

. estat teffects, atet

Predictive margins                                     Number of obs   = 7,127
Model VCE: OIM                                         Subpop. no. obs = 3,043



                          Delta-method                                        
                   Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]

ATE                                                                           
     program                                                                  
   (1 vs 0)      .1690133   .0523389     3.23   0.001     .0664311    .2715956

Note: Standard errors treat sample covariate values as fixed and
      not a draw from the population. If your interest is in
      population rather than sample effects, refit your model
      using vce(robust).

The ATET is a 0.1690 increase in the probability of college graduation.

Are these effects constant? One of the reasons entreat() fits a fully interacted model by default is so that you can evaluate questions like that.

Let's explore the graduation probabilities as a function of parents' income and high school GPA. Our data contain incomegrp and hsgpagrp, which categorize those two variables.

Stata's margins command will handle problems like this. margins reports results in tables. When run after margins, marginsplot shows the same result as a graph. We could type

. margins, over(program incomegrp hsgpagrp)
  (output omitted)

. marginsplot, plot(program) xlabels(0 4 8 12)

The graphs show the expected graduation rates for those who took the program (in red) and those who did not take the program (in blue). The four panels are GPA groups. The x axis of each graph is parents' income.

The program helps those with lower GPAs more and also those with moderately high GPAs from low-income families.

We like these graphs. Many researchers will want to graph the ATE. Here it is:

. margins r.program, over(incomegrp hsgpagrp) predict(fixedasf)
  (output omitted)

. marginsplot, by(hsgpagrp) xlabels(0 4 8 12)

Tell me more

Find out more about extended regression models for panel data.

Learn more about Stata's extended regression models features.

ERMs are documented in their own manual. It covers syntax and usage in detail, a much deeper development of the concepts, the statistical formulation of ERMs, and much more. See the Stata Extended Regression Models Reference Manual.

The Stata Extended Regression Models Reference Manual also demonstrates ERMs on ordered probit models and interval-measured outcomes models. It demonstrates other combinations of endogenous(), select(), extreat(), and entreat().

Here are links to examples from the manual that demonstrate specific models:

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies


		Coefficient Std. err. z P>\|z\| [95% conf. interval]

bmi
sex		-1.080003 .3218772 -3.36 0.001 -1.710871 -.4491354
steps		-2.225672 .0891093 -24.98 0.000 -2.400323 -2.051021
_cons		35.68498 .5815979 61.36 0.000 34.54507 36.82489

selected
sex		.8330193 .2647175 3.15 0.002 .3141825 1.351856
steps		.2694679 .0886263 3.04 0.002 .0957635 .4431723
education		1.053498 .1027103 10.26 0.000 .8521891 1.254806
_cons		-16.63009 1.963632 -8.47 0.000 -20.47873 -12.78144

steps
sex		.3393479 .1044252 3.25 0.001 .1346783 .5440176
distance		-.985911 .0240427 -41.01 0.000 -1.033034 -.9387881
_cons		9.035609 .0711241 127.04 0.000 8.896208 9.17501

var(e.bmi)		7.916253 .7247563 6.615911 9.472174
var(e.steps)		.8907777 .0563377 .7869273 1.008333

corr(e.sel~d,
e.bmi)		.6676526 .0960975 6.95 0.000 .4355011 .8165333
corr(e.steps,
e.bmi)		.600721 .0400543 15.00 0.000 .5164193 .6734909
corr(e.steps,
e.selected)		.2030564 .123501 1.64 0.100 -.0465152 .4287674


		Coefficient Std. err. z P>\|z\| [95% conf. interval]

graduate
		(output omitted)

program
		(output omitted)

hsgpa
		(output omitted)

var
		(output omitted)

corr
		(output omitted)


program
campus		.6629004 .0467013 14.19 0.000 .5713675 .7544334
income		-.0772836 .0050832 -15.20 0.000 -.0872465 -.0673207
_cons		-.3417554 .0509131 -6.71 0.000 -.4415433 -.2419675


corr(e.pro~m,
e.graduate)		.2610808 .1162916 2.25 0.025 .0226638 .4713994
corr(e.hsgpa,
e.graduate)		.2905934 .0633915 4.58 0.000 .162068 .4094238
corr(e.hsgpa,
e.program)		-.0024032 .015235 -0.16 0.875 -.0322522 .0274501


hsgpa
income		.0429837 .000954 45.06 0.000 .0411139 .0448535

hscomp
moderate		-.1180259 .0066271 -17.81 0.000 -.1310148 -.105037
high		-.2064778 .0104663 -19.73 0.000 -.2269914 -.1859643

_cons		2.711822 .0075609 358.66 0.000 2.697003 2.726641


graduate
program#
c.income
0		.1777645 .0140365 12.66 0.000 .1502534 .2052756
1		.2184452 .0181589 12.03 0.000 .1828543 .2540361

roommate#
program
yes#0		.4320001 .0477783 9.04 0.000 .3383564 .5256437
yes#1		.3548558 .0546206 6.50 0.000 .2478015 .4619102

program#
c.hsgpa
0		1.860516 .3152604 5.90 0.000 1.242617 2.478415
1		1.542167 .3131915 4.92 0.000 .9283226 2.156011

program
0		-6.567493 .8892133 -7.39 0.000 -8.310319 -4.824667
1		-5.18857 .8443761 -6.14 0.000 -6.843517 -3.533623


Probit probability model of graduation

Variable program==0 program==1

c.income 0.1778 0.2184
i.roommate 0.4320 0.3546
c.hsgpa 1.8605 1.5422
intercept -6.5675 -5.1889


		Delta-method
		Margin Std. Err. z P>\|z\| [95% Conf. Interval]

ATE
program
(1 vs 0)		.1687485 .0510067 3.31 0.001 .0687772 .2687198