|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Binomial regression
Good day to all.
I would like to share some observations on binomial regression using
Stata (and SAS).
Essentially, the problem is that we have a binary outcome Y (0/1) and
want to model it as a function of covariates (X1, X2, etc) via the
generalized linear model
p = b0 + b1*X1 + b2*X2 + ...
So far, I've done such projects in Stata and I've noted that binomial
regression sometimes fails to converge and/or gives ML estimates of the
coefficients that yield estimated probabilities that are outside the
[0,1] range (typically, for some modest number of observations).
A colleague suggested that SAS might give better results and I was
skeptical at first. But I had some time in my hands (it's summer after
all!), so I took a few datasets and run them in Stata (-glm-) and SAS
(Proc Genmod).
The Stata (8.2) implementation I used is the default ML (Newton-Raphson)
algorithm
- glm y x1 x2 ..., fam(bin) link(i) search
or sometimes
- glm y x1 x2 ..., fam(bin) link(i) search fisher(#)
where # is some number of iterations with the Fisher scoring algorithm.
The SAS (9.1) implementation I used is
proc genmod;
model y = x1 x2 ... / dist=bin link=id type3 wald itprint;
run;
[I also played with specification of starting values, but I will leave
this piece out, and confine my posting to results obtained using the
default starting values in both packages.]
Here's what I've found:
(1) Convergence
Stata often gets bogged down ("backed up") after a few iterations and
does not converge.
Specifying Fisher scoring for some iterations in the beginning helps.
After Newton-Raphson takes over from Fisher scoring, it occasionally
does converge. Most often, I have to use Fisher scoring throughout to
get convergence. But see point #3 below.
SAS does seem to often converge (on the basis of parameter vector
convergence), but also warns that the "relative Hessian convergence
criterion" has not been achieved and that "convergence is questionable"
(indicating that the likelihood has not really converged sufficiently).
(2) Likelihood of final model
The log-likelihood of the final Stata model is often somewhat better
than that of the final SAS model. This might suggest that the Stata
results are "better". However, see the drawback in point #4 below.
(3) Estimated coefficients and standard errors
Naturally, when Stata and SAS give different final models, their
estimated coefficients are different.
But beware using Fisher's scoring throughout to get convergence and a
final model. Sometimes, this final model will have absurdly small
standard errors (with p < 0.001 for all variables). If something like
this happens, it might be useful to compute standard errors using the
option "OPG":
- glm y x1 x2 ..., fam(bin) link(i) search fisher(#) opg
[There are special complications when there are covariate levels that
have observed probability of 0 or 1 (ie, all observations are "0s" or
"1s"), but I'll leave this issue aside.]
(4) Estimated probabilities
When Stata has convergence trouble (and sometimes when it does not), it
warns that some "parameter estimates produce inadmissible mean estimates
in one or more observations."
SAS gives no such warnings.
When I compute the predicted probabilities from both the SAS and the
Stata final models (which often are not the same, as I explained above),
SAS almost always has them in the [0,1] interval but Stata has quite a
few of them outside the interval.
I most recently ran 7 different models in a dataset (9 different
outcomes, each with the same 4 predictor variables). My total N was 278.
For 2 outcomes, SAS and Stata gave identical results (with all predicted
probabilities in the [0,1] range).
For the remaining 5 outcomes, SAS always gave a final model (although
with the warning regarding the Hessian convergence).
Stata w/ NR did not converge after 50 iterations.
Stata w/ FS did converge in 4 of the 5 cases (after 6-20 iterations).
In 4 of the 5 cases, the final model of Stata had a somewhat better
log-likelihood than the final model of SAS.
But:
(i) Although SAS's estimated standard errors appeared sensible, Stata's
(w/ FS) were clearly too small in 4 of the 5 cases.
(ii) The final SAS models never yielded a predicted probability less
than 0 or greater than 1. In contrast, Stata's models yielded
probabilities outside the [0,1] interval for 5-15% of the observations.
I've done this sort of comparison with a couple of other datasets with
similar results. I know it is not proof positive (as opposed to a
carefully constructed simulation), but I conclude that Stata's algorithm
is not as robust as SAS's and would advise the use of SAS for binomial
regression problems.
Perhaps I have missed something or perhaps other people have different
experiences. So, please take this with the usual caution.
Best,
Constantine
--
The documents accompanying this transmission may contain confidential
health or business information. This information is intended for the
use of the individual or entity named above. If you have received
this information in error, please notify the sender immediately and
arrange for the return or destruction of these documents.
Constantine Daskalakis, ScD
Assistant Professor,
Thomas Jefferson University, Division of Biostatistics
1015 Chestnut St., Suite M100, Philadelphia, PA 19107
Tel: 215-955-5695
Fax: 215-503-3804
Email: [email protected]
Webpage: http://www.jefferson.edu/clinpharm/biostatistics/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/