Home  /  Products  /  StataNow  /  Control-function linear and probit models

<- See more new Stata features

Highlights

  • Control-function regression

    • Linear models using cfregress

    • Probit models using cfprobit

  • Standard errors that properly account for estimated control functions

  • Linear, probit, fractional probit, and Poisson models allowed in first stage

  • Control functions can be interacted with other variables or with each other

  • Robust, cluster–robust, and heteroskedasticity- and autocorrelation-consistent VCEs allowed

  • See more features for linear and binary-outcome models

With the new cfregress and cfprobit commands, you can fit control-function linear and probit models, which provide a flexible alternative to traditional instrumental-variables (IV) methods for models with endogenous variables. You can include continuous, binary, fractional, and count endogenous variables. And you can easily test for endogeneity. These commands are part of StataNow™.

Control-function models allow researchers to estimate causal relationships even when some explanatory variables are endogenous. Here first-stage models are fit for all endogenous variables, and the residuals are then used to form control functions that are included in the main outcome model to account for endogeneity.

Researchers often use control-function methods when traditional IV methods cannot accommodate desired model features such as flexible handling of interacted endogenous variables or modeling endogenous binary, fractional, and count variables. The cfregress and cfprobit commands fit control-function models, allow for great flexibility in the interaction and modeling of endogenous variables, and provide standard errors that account for the inclusion of estimated control functions. After fitting the model, you can easily perform tests of endogeneity.

Let's see it work

Say we are interested in modeling the average rental rate (rent) in each US state as a function of average housing values (hsngval) and the proportion of the population living in urban areas (pcturban). Because average housing values are likely to be endogenous, we include a measure of median family income, faminc, and indicator variables for the region in which each state is located, i.region, as instruments for hsngval. For convenience, we rescale hsngval and faminc to be on a scale similar to rent.

With cfregress, we could reproduce the estimates of a two-stage least-squares (2SLS) IV regression. Whereas 2SLS replaces the endogenous variable in the main regression with fitted values from a first-stage regression, control-function regression keeps the endogenous variable and includes the first-stage residual as a regressor called a control function.

. webuse hsng
(1980 Census housing data)

. replace hsngval = hsngval/1000
variable hsngval was long now double
(50 real changes made)

. replace faminc = faminc/1000
variable faminc was long now double
(50 real changes made)

. cfregress rent pcturban (hsngval = faminc i.region)

Control-function linear regression                     Number of obs =      50
                                                       Wald chi2(2)  =   90.76
                                                       Prob > chi2   =  0.0000
                                                       R-squared     =  0.5989
                                                       Root MSE      = 22.1656
Endogenous variable model:
    Linear: hsngval

rent Coefficient Std. err. z P>|z| [95% conf. interval]
rent
hsngval 2.239833 .3284392 6.82 0.000 1.596104 2.883562
pcturban .081516 .2987652 0.27 0.785 -.504053 .667085
_cons 120.7065 15.22839 7.93 0.000 90.85942 150.5536
e.rent
cf(hsngval) -1.588908 .4333422 -3.67 0.000 -2.438243 -.7395726

The control function is shown as cf(hsngval) because it is the control function generated from the first-stage model of the endogenous variable hsngval. Control functions enter the main equation, but they are listed under e.rent because we consider them part of the model for the error term.

However, we might suspect that the endogeneity in the model depends not just on the control function but on its interaction with faminc. We can include this interaction using the interact() option.

. cfregress rent pcturban (hsngval = faminc i.region, interact(faminc))


Control-function linear regression                     Number of obs =      50
                                                       Wald chi2(2)  =   95.16
                                                       Prob > chi2   =  0.0000
                                                       R-squared     =  0.5945
                                                       Root MSE      = 22.2851
Endogenous variable model:
    Linear: hsngval

rent Coefficient Std. err. z P>|z| [95% conf. interval]
rent
hsngval 2.155381 .3437284 6.27 0.000 1.481686 2.829076
pcturban .4794597 .2362242 2.03 0.042 .0164688 .9424506
_cons 98.15909 13.86958 7.08 0.000 70.97521 125.343
e.rent
cf(hsngval) 10.66765 3.619442 2.95 0.003 3.573673 17.76163
cf(hsngval)
faminc -.5610651 .1743049 -3.22 0.001 -.9026965 -.2194338
Instruments for hsngval: faminc 2.region 3.region 4.region

The control-function interaction is included here as cf(hsngval)#faminc.

Relative to the first model, there are several changes. We now have evidence that the coefficient on pcturban is different from 0, while the coefficient on hsngval is slightly smaller. We also have evidence that the interaction itself has a coefficient different from 0 and, thus, that the interaction should be included in the model.

A joint test of cf(hsngval) and cf(hsngval)#faminc amounts to a test of endogeneity, and we can perform the test with the postestimation command estat endogenous.

. estat endogenous

Tests of endogeneity
H0: Variables are exogenous

 ( 1)  [e.rent]cf(hsngval) = 0
 ( 2)  [e.rent]cf(hsngval)#c.faminc = 0

           chi2(  2) =   15.30
         Prob > chi2 =    0.0005

This gives us strong evidence for endogeneity. Here we have used conventional standard errors, but estat endogenous will conduct an appropriate test after estimation even with robust, cluster–robust, and heteroskedasticity- and autocorrelation-consistent standard errors.

Stata's control-function regression commands also allow users to specify nonlinear first-stage models for endogenous binary, fractional, or count variables.

For example, we can estimate the effect of having health insurance (ins) on the log of prescription drug expenditure (lndrug) using marital status (married) and employment status (work) as instruments.

. use https://www.stata-press.com/data/r18/drugexp, clear
(Presciption drug expenditures)

. cfregress lndrug age lninc (ins = married work, probit interact(ins)),
     mainonly(chron) vce(robust)


Control-function linear regression                     Number of obs =   6,000
                                                       Wald chi2(4)  = 1973.78
                                                       Prob > chi2   =  0.0000
                                                       R-squared     =  0.2432
                                                       Root MSE      =  1.2172
Endogenous variable model:
    Probit: 1.ins

Robust
lndrug Coefficient std. err. z P>|z| [95% conf. interval]
lndrug
1.ins -.8598836 .3483648 -2.47 0.014 -1.542666 -.1771011
chron .4671725 .0319731 14.61 0.000 .4045064 .5298387
age .1021359 .00292 34.98 0.000 .0964128 .1078589
lninc .0550672 .0225036 2.45 0.014 .0109609 .0991735
_cons 1.665539 .2527527 6.59 0.000 1.170153 2.160925
e.lndrug
cf(ins) .5252243 .226367 2.32 0.020 .0815532 .9688954
cf(ins)#ins .2702095 .2585099 1.05 0.296 -.2364605 .7768796
Instruments for 1.ins: married work

Here we have used the probit option within the parentheses to specify a probit model for our first-stage regression. (Note that if we had multiple sets of parentheses, each first-stage regression could have its own model.) We have again included a control-function interaction, and we have also included an indicator for a chronic condition, chron, in the main regression but not the first stage using the mainonly() option. We have requested heteroskedasticity-robust standard errors using the vce(robust) option.

As it happens, this regression is equivalent to fitting an endogenous treatment-effects model (see Example 2 in [CAUSAL] etregress). What if your outcome is binary? The cfprobit command fits control-function models in just the same way except that the model for the main equation is a probit model. Both cfregress and cfprobit allow users the flexibility to specify a large class of models where one or more explanatory variables are endogenous.

Tell me more

Read more about control-function regression methods in the Stata Base Reference Manual; see [R] cfregress and [R] cfprobit.

View all the new features in Stata 18 and in linear and binary-outcome models.

Made for data science.

Get started today.