FAQ: Two-stage least-squares regression

Home / Resources & support / FAQs / Two-stage least-squares regression

Must I use all of my exogenous variables as instruments when estimating instrumental variables regression?

Title		Two-stage least-squares regression
Author		Vince Wiggins, StataCorp

Note: This model could also be fit with sem, using maximum likelihood instead of a two-step method.
You can find examples for recursive models fit with sem in the “Structural models: Dependencies between response variables” section of [SEM] intro 5 — Tour of models.

Someone posed the following question:

I am estimating an equation:

\(Y = a + bX + cZ + dW\)

I then want to instrument \(W\) with \(Q\). I know the first-stage regression is supposed to be

\(W = e + fX + gZ + hQ\)

(for instance, use all the exogenous variables in the first stage). Actually this is automatically done if I use the ivregress command. However, I only want to use \(Q\) to instrument \(W\) without using \(X\) and \(Z\) in the first stage. Is there a way I can do it in Stata? I can regress \(W\) on \(Q\) and get the predicted \(W\), and then use it in the second-stage regression. The standard errors will, however, be incorrect.

ivregress will not let you do this and, moreover, if you believe \(W\) to be endogenous because it is part of a system, then you must include \(X\) and \(Z\) as instruments, or you will get biased estimates for b, c, and d.

Consider the system

\(Y1 = a0 + a1*Y2 + a2*X1 + a3*X2 + e1 \qquad\qquad\qquad\qquad\qquad \text{(1)}\)

\(Y2 = b0 + b1*Y1 + b2*X3 + b3*X4 + e2 \qquad\qquad\qquad\qquad\qquad\:\: \text{(2)}\)

Warning: Assume we are estimating structural equation (1); if \(X1\) and \(X2\) are exogenous, then they must be kept as instruments or your estimates will be biased. In a general system, such exogenous variables must be used as instruments for any endogenous variables when the instrumented value for the endogenous variables appears in an equation in which the exogenous variable also appears.

Consider the reduced forms of your two equations:

\(Y1 = e0 + e1*X1 + e2*X2 + e3*X3 + e4*x4 + u1 \qquad\qquad\qquad\: \text{(1r)}\)

\(Y2 = f0 + f1*X1 + f2*X2 + f3*X3 + f4*x4 + u2 \qquad\qquad\quad\:\:\; \text{(2r)}\)

where \(e\#\) and \(f\#\) are combinations of the \(a\#\) and \(b\#\) coefficients from (1) and (2) and \(u1\) and \(u2\) are linear combinations of \(e1\) and \(e2\).

All exogenous variables appear in each equation for an endogenous variable. This is the nature of simultaneous systems, so efficiency argues that all exogenous variables be included as instruments for each endogenous variable.

Here is the real problem. Take (1): the reduced-form equation for \(Y2\), (2r), clearly shows that \(Y2\) is correlated with \(X2\) (by the coefficient \(f2\)). If we do not include \(X2\) among the instruments for \(Y2\), then we will have failed to account for the correlation of \(Y2\) with \(X2\) in its instrumented values. Since we did not account for this correlation, when we estimate (1) with the instrumented values for \(Y2\), the coefficient \(a3\) will be forced to account for this correlation. This approach will lead to biased estimates of both \(a1\) and \(a3\).

For a brief reference, see Baltagi (2011). See the whole discussion of 2SLS, particularly the paragraph after equation 11.40, on page 265. (I have no idea why this issue is not emphasized in more books.)

Failing to include \(X4\) affects only efficiency and not bias.

However, there is one case where it is not necessary to include \(X1\) and \(X2\) as instruments for \(Y2\). That is when the system is triangular such that \(Y2\) does not depend on \(Y1\), but you believe it is weakly endogenous because the disturbances are correlated between the equations. You are still consistent here to do what ivregress does and retain \(X1\) and \(X2\) as instruments. They are, however, no longer required. Then you could do what you suggested and just regress on the predicted instruments from the first stage.

If you do use this method of indirect least squares, you will have to perform the adjustment to the covariance matrix yourself. Consider the structural equation

\(y1 = y2 + x1 + e\)

where you have an instrument \(z1\) and you do not think that \(y2\) is a function of \(y1\).

The following example uses only \(z1\) as an instrument for \(y2\). Let's begin by creating a dataset (containing made-up data) on \(y1\), \(y2\), \(x1\), and \(z1\):

. sysuse auto 
(1978 automobile data)

. rename price y1
. rename mpg y2
. rename displacement z1
. rename turn x1

Now we perform the first-stage regression and get predictions for the instrumented variable, which we must do for each endogenous right-hand-side variable.

. regress y2 z1



      Source         SS           df       MS     Number of obs   =        74

      F(1, 72)        =     71.41

       Model    1216.67534         1  1216.67534    Prob > F        =    0.0000

    Residual    1226.78412        72  17.0386683    R-squared       =    0.4979

      Adj R-squared   =    0.4910

       Total   2443.45946        73  33.4720474    Root MSE        =    4.1278





          y2   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
          z1    -.0444536   .0052606    -8.45   0.000    -.0549405   -.0339668
       _cons     30.06788   1.143462    26.30   0.000     27.78843    32.34733



. predict double y2hat
(option xb assumed; fitted values)

  * perform IV regression 

. regress y1 y2hat x1
 


      Source         SS           df       MS      Number of obs   =        74

      F(2, 71)        =     12.41

       Model     164538571     2  82269285.5    Prob > F        =    0.0000

    Residual     470526825    71  6627138.38    R-squared       =    0.2591

      Adj R-squared   =    0.2382

       Total    635065396    73  8699525.97    Root MSE        =    2574.3





          y1   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
       y2hat    -463.4688    117.187    -3.95   0.000    -697.1329   -229.8046
          x1    -126.4979   108.7468    -1.16   0.249    -343.3328    90.33697
       _cons     21051.36   6451.837     3.26   0.002     8186.762    33915.96

Now we correct the variance–covariance by applying the correct mean squared error:

. rename y2hat y2hold
. rename y2 y2hat
. predict double res, residual
. rename y2hat y2                       /* put back real y2 */
. rename y2hold y2hat  
. replace res = res^2
(74 real changes made)
 
. summarize res


    Variable          Obs        Mean    Std. dev.       Min        Max
   
         res           74     7553657    1.43e+07   117.4375   1.06e+08


. scalar realmse = r(mean)*r(N)/e(df_r) 
                                  /* much ado about small sample */
. matrix bmatrix = e(b)
. matrix Vmatrix = e(V)
. matrix Vmatrix = e(V) * realmse / e(rmse)^2
. ereturn post bmatrix Vmatrix, noclear
. ereturn display



          y1   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
       y2hat    -463.4688   127.7267    -3.63   0.001    -718.1485    -208.789
          x1    -126.4979   118.5274    -1.07   0.289    -362.8348    109.8389
       _cons     21051.36   7032.111     2.99   0.004      7029.73    35072.99

Reference

Baltagi, B. H. 2011.: Econometrics. New York: Springer.

Must I use all of my exogenous variables as instruments when estimating instrumental variables regression?

Reference

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Source	SS df MS	Number of obs = 74
		F(1, 72) = 71.41
Model	1216.67534 1 1216.67534	Prob > F = 0.0000
Residual	1226.78412 72 17.0386683	R-squared = 0.4979
		Adj R-squared = 0.4910
Total	2443.45946 73 33.4720474	Root MSE = 4.1278


y2		Coefficient Std. err. t P>\|t\| [95% conf. interval]

z1		-.0444536 .0052606 -8.45 0.000 -.0549405 -.0339668
_cons		30.06788 1.143462 26.30 0.000 27.78843 32.34733


y1		Coefficient Std. err. t P>\|t\| [95% conf. interval]

y2hat		-463.4688 117.187 -3.95 0.000 -697.1329 -229.8046
x1		-126.4979 108.7468 -1.16 0.249 -343.3328 90.33697
_cons		21051.36 6451.837 3.26 0.002 8186.762 33915.96

Variable		Obs Mean Std. dev. Min Max

res		74 7553657 1.43e+07 117.4375 1.06e+08


y1		Coefficient Std. err. t P>\|t\| [95% conf. interval]

y2hat		-463.4688 127.7267 -3.63 0.001 -718.1485 -208.789
x1		-126.4979 118.5274 -1.07 0.289 -362.8348 109.8389
_cons		21051.36 7032.111 2.99 0.004 7029.73 35072.99

Stata/MP4 Annual License (download)

Must I use all of my exogenous variables as instruments when estimating instrumental variables regression?

Reference

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies