Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: negative R2 in 2SLS


From   "David M. Drukker, Stata Corp" <[email protected]>
To   [email protected]
Subject   RE: st: negative R2 in 2SLS
Date   Tue, 14 Jan 2003 09:45:22 -0600

John Hendrickx <[email protected]> wrote that he had some doubts
about the FAQ http://www.stata.com/support/faqs/stat/2sls.html


Short Answer:

In essence, the FAQ in question explains why observing a negative R-squared
(R2) after estimating the parameters of a model via Two-stage least squares
(2SLS) using -ivreg- does not necessarily indicate model misspecification.
John questions this result.  In the Long Answer below, I provide simulation
evidence that the FAQ is indeed correct.


Long Answer:

John begins with a nice summary of the issue and a summary of the FAQ.

> I'm interested in comments and advice on how to interpret the
> parameters of a two stage least squares (2SLS) model with a negative
> R2. The problem is discussed on the website of Stata at
> http://www.stata.com/support/faqs/stat/2sls.html

> To summarize for myself, 2SLS uses instrumental variables to model
> the effects of righthand side endogenous variables. These
> instruments are the values of the endogenous variables as predicted
> by the exogenous variables in the model. When these instruments are
> replaced by the endogenous variables themselves, the predicted
> values can in some cases be way off, so much so that the residual SS
> is greater than the total SS. This would mean that the model SS is
> negative and hence that the R2 of the model is negative. This can
> happen even though the model contains strong and significant
> effects.

> The faq referred to above states that a negative R2 need not be a
> problem and that parameters can be safely interpreted if they are
> significant with reasonably small standard errors: "What does it
> mean when RSS is greater than TSS? Does this mean our parameter
> estimates are no good? Not really.  You can easily develop
> simulations where the parameter estimates from two- stage are quite
> good while the MSS is negative. Remember why we estimate two-stage
> models. We are interested in the parameters of the structural
> equation the elasticity of demand, the marginal propensity to
> consume, etc. If our two-stage model produces estimates of these
> parameters with acceptable standard errors, we should be happy
> regardless of MSS or R2.  If we were strictly interested in
> projections of the dependent variable, then we should probably
> consider the reduced form of the model."

John then raises his doubts.  In particular he writes, 

> My take would be that the model fits the data very poorly and that
> the estimats should be regarded with exreme suspicion. This is
> generally the advice for maximum likelihood models, only interpret
> parameters of a model that fits the data well. A negative R2 would
> mean that the model was mis- specified and should not be
> interpreted.  Comments? And could anything be inferred about the
> nature of the misspecification?


There are a number of ways of illustrating that the FAQ is correct.  Perhaps
the most accessible is via simulation.  I interpret the claim of the FAQ to
be that there are models in which in the distribution of 2SLS estimates of
the parameters will be well approximated by its theoretical distribution but
that the R2 computed from some samples will be negative.

I simulate data from the model

(1)	y = 1 + - .1*x + e1 + e2 
(2)	x = w + z + c1 + .5*e1
(3)	z = 1.5*c1 + e3

where e1, e2, w, c1, are all independent normal random variables.  The c1
term in equations (2) and (3) provide the correlation between x and z.  The
e1 term in equations (1) and (2) is the source of the correlation between x
and the error term (e1 + e2) for y.  The coefficient of -.1 is the parameter
that we are trying to estimate.  We are going to estimate this parameter via
2SLS using -ivreg- with y as the dependent variable, x as the endogenous
variable and z as the instrument for x.  In other words, for each simulated
sample we construct y, x, and z using independent draws of the standard normal
variables e1, e2, w, and c1 and equations (1)-(3).  Then we use

. ivreg y (x = z)

to estimate the coefficient -.1 .  For each simulated sample we record the
following statistics.

	b1 		the estimate of the coefficient (-.1)
	p 		the p of the null hypothesis that b1 = -.1
	reject  	is one if p<.05 and 0 otherwise
	r2		the computed R2 (missing if mss < 0)
	mss     	the value of the model sum of squares 
	rho_x1e 	the correlation between x1 and e=e1+e2
	rho_x1z1	the correlation between x1 and z1
	fsf		the first stage F-statistic
	p_fsf		the p-value from the first stage F-statistic


Below my signature is the Stata code for drawing 2,000 simulations of this
model, estimating the coefficient -.1, computing the statistics of interest
and finally summarizing the results.  Each simulated sample contains 1,000
observations, so the results should not be attributed to a small sample
size.

Here is what I obtained when I used -summarize- to look at the results.


. sum

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          b1 |      2000   -.1025507     .054484   -.361239   .0578525
           p |      2000    .4951122    .2863575   .0000638   .9994608
      reject |      2000         .05    .2179995          0          1
          r2 |        48    .0053981    .0050849   .0001775   .0205909
         mss |      2000   -82.03539    49.51527  -317.1851   37.13932
-------------+--------------------------------------------------------
     rho_x1e |      2000    .2344962    .0302909   .1359878    .325926
    rho_x1z1 |      2000    .5544141    .0222491    .483774   .6284751
         fsf |      2000    445.7681    51.80968   304.9355   651.5349
       p_fsf |      2000    1.63e-34    2.24e-33          0   7.55e-32


The results for rho_x1e, rho_x1z1, fsf, p_fsf indicate that the correlations
between the endogenous variable and the error term and between the
endogenous variable and its instrument are reasonable and that there is no
weak instrument problem.  The results for b1, p and reject indicate that the
mean estimate of the coefficient on x is very close to its true value of -.1
and that there is no size distortion of the test that coefficient on x =
-.1.  In short, the distribution of the estimates, b1, is very well
approximated by its theoretical asymptotic distribution.  Together, these
results that imply that the 2SLS estimator is performing according the
theory in these simulations.

Now note that there are only 48 observations on r2.  This is because there
are 1,952 observations in which mss < 0.

. count if mss < 0
 1952

Thus, the results illustrate that there is at least one model for which the
distribution of the 2SLS estimates of the parameters is very well
approximated by its asymptotic distribution, but that the R2 will be
negative in most of the individual samples.  To obtain more models that
produce the same qualitative results, simply change the coefficient -.1 by a
small amount.  As one would expect, increasing the coefficient -.1 reduces
the fraction of the of simulated samples that produce a negative R2.

I hope that this helps.

	--David
	[email protected]



------------------------------begin negr2.do--------------------------------
clear
set obs 1000
gen keep = 1

set seed 123456


postfile results b1 p reject r2 mss rho_x1e rho_x1z1 fsf p_fsf using ivr2sim , replace

forvalues i = 1/2000 {
	
	qui capture drop c1 z1 y e2 e x1 e1
	
	gen c1 = invnorm(uniform()) 
	gen z1 = invnorm(uniform()) + 1.5*c1

	gen e1 = invnorm(uniform()) 
	gen e2 = invnorm(uniform()) 

	gen x1 = invnorm(uniform()) + c1  + .5*e1

	gen e = e1 +e2 

	qui corr x1 e	
	scalar rho_x1e = r(rho)
	
	qui corr x1 z1
	scalar rho_x1z1 = r(rho)

	qui reg x1 z1
	scalar fsf = e(F)
	scalar p_fsf = Ftail(1,98,fsf)

	qui gen y = 1 - .1*x1 + e 

	qui ivreg y (x1=z1)

	qui test x1 = -.1 
	local reject = (r(p) < .05) 

	post results (_b[x1]) ( r(p)) (`reject') (e(r2)) (e(mss)) /*
		*/ (rho_x1e) (rho_x1z1) (fsf) (p_fsf)
}	

clear 
use ivr2sim

sum
count if mss < 0

--------------------------------end negr2.do------------------------


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index