NOTE: We tried sending this post yesterday, but it turns out to be larger
than the limits set for statalist's majordomo. We are sending it today
in two pieces.
This is the second piece. If you have not yet read or received the
first piece, wait and read it first.
+------+
|Part 2|
+------+
Continuation of post about Joachim Wagner <[email protected]> questions
regarding McCullough and Vinod's AER paper.
---------------------------
Preface to long answer to 1
---------------------------
First, let me note that we do not believe that Stata was among the packages
used in the article. The author's are understandably maintaining anonymity of
the packages and that includes not divulging who is not in the study. They
will, however, graciously supply replication datasets and the code used to
produce the published results. From those we have coded a user-written
maximum likelihood (-ml-) estimator for the custom estimator problem presented
in the paper. Stata's behaviour does not match any of the reported
behaviours, though it is closest to the reported "gold standard". From this
we infer that Stata was probably not among the tested packages.
------------------------------------------
Longer answer to 1 -- the probit estimator
------------------------------------------
Madalla's data exhibits a very common "problem" with binary outcome data --
the outcome (d1) is always "success" (d1==1) whenever one of the indicator
regressors (d2) is 0. This means that d2==0 perfectly predicts "success"
(d1==1).
To capture this, the probit estimator must drive the coefficient on d2 to
infinity, or something large relative to the other coefficients, so as to
override the effect of all the other regressors. Once the coefficient on d2
gets that large, the observation has no effect whatsoever on the likelihood
and thus the coefficient becomes irrelevant. This can cause problems for
maximization algorithms.
Operationally, it is not as problematic as it sounds, but it is better to be
safe and identify "perfect predictors" up front and remove them and their data
from the estimation. Yes, removing them is the right thing to do. If the
maximizer drives their coefficient to where it needs to be to maximize the
likelihood those observations will have a probability of 1 and contribute
nothing to the likelihood. Put another way, there is no statistical
information, at least w.r.t. the likelihood, in these observations. Their
contribution to the likelihood can be replaced by the simple rule, if d1==0
then d2=1. This is exactly what Stata does and has done for over 12 years.
(To be complete, it is possible to set up permutation tests for coefficients of
"perfect predictors", but this is different and outside the scope of maximum
likelihood estimation.)
Here are the results of running -probit- in Stata on the data from Madalla.
------------------------- Begin Stata probit results ------------------------
. probit d1 t y lf nw d2
note: d2 != 0 predicts success perfectly
d2 dropped and 15 obs not used
Iteration 0: log likelihood = -17.961912
Iteration 1: log likelihood = -12.039252
Iteration 2: log likelihood = -9.4820411
Iteration 3: log likelihood = -8.7547721
Iteration 4: log likelihood = -8.6442841
Iteration 5: log likelihood = -8.6403866
Iteration 6: log likelihood = -8.6403799
Probit estimates Number of obs = 29
LR chi2(4) = 18.64
Prob > chi2 = 0.0009
Log likelihood = -8.6403799 Pseudo R2 = 0.5190
------------------------------------------------------------------------------
d1 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | .0113124 .0056558 2.00 0.045 .0002272 .0223975
y | 6.461118 3.148348 2.05 0.040 .2904686 12.63177
lf | -.4092877 .2572157 -1.59 0.112 -.9134213 .0948459
nw | 42.49827 20.74973 2.05 0.041 1.829544 83.16699
_cons | 6.915517 11.3225 0.61 0.541 -15.27617 29.10721
------------------------------------------------------------------------------
note: 0 failures and 1 success completely determined.
------------------------- End Stata probit results --------------------------
Note particularly the first message after the -probit- command. To be clear,
including those 15 observations and having a large coefficient on d2 would
produce EXACTLY the same likelihood and estimates of the other coefficients.
It is just more numerically stable to exclude the d2 regressor and 15
perfectly predicted observations, and that is what Stata does.
The short story is -- Stata takes care of almost all problems in the data for
all official estimators.
As a side note, we would like to at least partially defend the "inaccurate"
results of the other estimation packages that did not always exclude the d2
regressor and perfectly predicted observations. We have already noted that
the coefficient for d2 cannot be estimated by ML so what is important
statistically is that (1) the other coefficients and their SEs are estimated
well and that (2) we are not misled as to the effect of d2. McCullough and
Vinod report Stokes as saying that other than the regressor for d2 all of the
packages agreed completely on the estimated coefficients and their standard
errors, so item (1) is met -- our inferences about these coefficients are NOT
affected by the whether d2 is included. They also report that the coefficient
estimates for d2 are large (ranging from 4.4 to 8.1) and that the standard
errors are extremely large (ranging from 46 to 114.550). Looking specifically
at the results reported in Madalla's text, the coefficient estimate is 4.63
and the SE is about 115. Note that the SEs are so large that we would never
infer that coefficient estimate is significantly different from 0, yet the
coefficient estimates are VERY large. Why do we say 4.63 is large? Remember
that with probit we are normalized to a standard normal distribution for the
latent variable and 4.63 is far into the tail of the standard normal, so 4.63
is large. If we were to look carefully at Madalla's estimates we would see
that the point estimate of d2 is so large as to dominate the other regressors,
but that we have no confidence in that estimate. So, we are also not likely
to make invalid inference about the "perfect predictor" variable, d2, and thus
item (2) is also met. Even so, the estimator is more stable when handled the
way Stata's -probit- estimator works, particularly if there are many "perfect
predictor" variables.
Scott Merryman <[email protected]> posted much more information from Stokes
paper. While Stokes acknowledges Stata identifies the problem, Scott notes
that Stokes finds our correction for "perfect predictor"s and correct
estimation of the model 'strange and confusing'. If you followed the argument
above, you should not find it confusing at all. Quoting Stokes,
> [...]
> All 15 such cases were detected by Stata. What is strange is that if
> Stata really believed its own message, it would have either reported
> nothing (the best decision) or reported the subset model in equation
> (2). Just not reporting the D2 coefficient but leaving all other
> coefficients in the model is a strange and confusing choice.
> [...]
We think Stokes finds Stata's ability to estimate the model confusing because
of the common misconception that the probit model cannot be estimated by ML on
such data. It can. Well, let's qualify that a bit, if you believe that
infinity should be excluded from the solution set for any ML parameter
estimate then perhaps technically what -probit- produces is not exaclty an ML
estimate. Be assured, however, that the coefficients reported by Stata are
consistent and the coverage rates are as good as those from probit estimates
on data without perfect predictors. Stokes might have felt better if he had
seen the 3 pages of [R] probit dedicated to this topic.
----------------------------------------------------------------------
OK, we feel better about probit, what about other official estimators?
----------------------------------------------------------------------
The developers at Stata are always interested in improving the convergence and
stability of our official estimators. That is one reason why we were so
interested in McCullough and Vinod's paper and in particular what went wrong
with the user-programmed ML estimator. We asserted earlier that the official
estimators are stable and robust to virtually all datasets. Why are we so
confident?
First, the developers at Stata have a lot of experience implementing ML
estimators and other estimators requiring nonlinear optimization. We can
often identify potential problem areas just by looking carefully at the
likelihood or by trying examples that we might expect to be deceptive to the
optimizer. Second, Stata users estimate literally millions of models and in
the rare cases where they see unusual results, they report those to us.
Sometimes there are datasets that produce degenerate likelihood surfaces and
the structure that leads to the degenerate surface cannot be specifically
identified before estimation. In almost all such cases, it is clear that
there is a problem -- either standard errors are missing, or there are
messages near the end of convergence about the function being non-concave. We
use such examples to improve pre-evaluation of the data, as with "perfect
predictors" in -probit-, improve the manual entry describing the estimator, or
to create FAQs explaining what causes such cases.
Here are just a few of the things done by official estimators to ensure
stability and robustness:
1) Remove collinear variables. Done by all estimators.
2) Identify perfect predictors. Done by -logit- and -probit-
3) Get good starting values, because far enough from the solution that
maximizes the criterion (usually log-likelihood) almost all
response surfaces are dominated by "bumps and wiggles" resulting
from the limited precision of digital computers. Done by all
estimators.
4) Estimate the model in a transformed metric that is either more
numerically stable and/or that constrains parameters to required
ranges (e.g. positive variances, correlations between -1 and 1,
etc.). Done by -intreg-, -heckman-, and many others.
5) Estimate the model using a concentrated log-likelihood. Done by
-boxcox-, -svar-, and others.
6) Identify constant within panel covariates when they cannot be
estimated by panel-data estimators. Done by fixed-effects and
first-differenced estimators.
7) Perform only likelihood ratio test statistics when the Wald
statistics are unreliable for the estimator. Done by -boxcox-.
8) Trap singular estimates of the covariance matrix of disturbances.
Done by -reg3- , -sureg-, and others.
9) Use alternate optimization methods, some specific to the
estimator. Done by -sureg-, -reg3-, -xtgls-, and others.
10) Adopt new convergence criteria to prevent premature convergence.
Done by -arch-, -arima-, and others.
There are many others.
-- Vince -- David
[email protected] [email protected]
References
Stokes, Houston. 2003. On the advantage of using two or more econometric
software systems to solve the same problem. Journal of economic and social
(forthcoming).
<end>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/