|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: overdispersion and underdispersion in nbreg / glm models
I take the Digest, and try to scan through the contents when possible.
I'm pleased that I happened
to catch your query.
Overdispersion in count models can arise from a wide variety of
reasons. Identifying the source of
overdispersion can help in finding a remedy for it. In some cases the
remedy is such that when applied,
the model is no longer overdispersed. I call this apparent
overdispersion. In other situations, the remedy
does not eliminate the fact that the data is overdispersed, but it
adjusts the model -- usually the standard
errors - so that the effect of bias as a result of the overdispersion
is minimized. These types of models
are ones that have real overdispsersion.
In the book, I create a simulated Poisson model with 3 or 4 defined
parameter estimates. That is, for example,
I define xb = b0 + b1*x1 + b2*x2 + b3*x3 with specific values for b*;
eg xb = 1 + .5*x1 + .75*x2 - 1.2*x3
The x* values are all separetely created random normal deviates; eg
--gen x1= invnorm(uniform)--
[[now should be invnorm(runiform)]]. I then use the values of xb in the
command, --rndpoisx-- or --genpoisson--.
The result is a Poisson random variate, xp, structured by the values of
xb. Running ---glm xp x1 x2 x3, fam(poi)--
results in a Poisson model with parameter estimates and intercept
having values very close if not identical to the
values specified. The Pearson dispersion statistic is also very close
to 1.0.
I then remodel the data, taking out one of the predictors, let's say
x1. --glm x2 x3, fam(poi)--
The parameter estimates are generally not the specified ones, and, more
importantly, the dispersion statistic becomes
greater than 1. Sometimes it is substantially greater than 1.
What does this tell us? Well, when we are modeling data, we generally
don't know what the parameter estimates are
going to be in advance. If we do find, though, that the dispersion
statistic substantally differs from 1.0, then we know that the model is
not well fitted. We may not know why though. In this case it was
because a necessary predictor was missing from the model. In real
situations, we hope that a variable is available in the data to remedy
the fit; ie when put into the
model, the dispersion closely approximates 1.0. The requisite
predictor, however, may not have been collected. Again, in real
situations, the missing predictor is one that is required to amend the
extra correlation in the data, reflected by the dispersion statistic.
All of this discussion is within the context of a Poisson model.
You appear to have modeled the data as negative binomial (NB-2) rather
than Poisson. The way you obained the value for alpha for inclusion in
the GLM NB model was correct. What many folks forget, though, is that
the NB model can itself be extradispersed. It may, for example, have
more variance in the data than allowed given the value of the mean.
Rather than compare mu and mu, as in Poisson, here we compare mu and
mu+a*mu*mu. The NB model may not adjust enough of the otherwise Poisson
overdispersion, and has a dispersion statistic of <1. Or, it may
overshoot and have excessive variance in the data - greater than
mu+a*mu*mu.
When I discussed the missing predictor and how it affects dispersion, I
was focusing on differentiating apparent from real overdispersion. I
did not address the NB model. It had to do with eliminating
overdispersion from within the Poisson model.
Here you are doing something quite different. It appears to be a
question as to why adding a particular predictor can change the model
from being underdispersed (a<1) to overdispersed (a>1). But here we are
referring to NB overdispersion, not Poisson overdispersion.
The addition of the new predictor evidently added considerably more
correlation to the data. I'll suspect that if you display the
correlations between all of the variables in the model, that the new
predictor would be rather highly correlated with one of more of the
other variables. However, the interaction of the variables may be such
that the extra correlation may not show in such a manner, but that too
is rare.
In any case, treat the inclusion of the varialble as any other
predictor; test it using the likelihood ratio test. It likely does not
contribute to the model. If so, exclude it, and search for other
reasons why there may be underdispersion. It may be that the data is
simply NB-underdispsersed (in distinction to poisson overdispersed),
and adjustments can be made to the SEs, eg robust SEs. I suggest not
scaling in this type of case, for reasons discussed in the book.
Perhaps I overkilled in my explanation, but I thought it important to
clarify the relationships involved, and to show why the discussion of
the missing predictor is not relevant to the solution of your query.
If you have additonal questions, you can contact me directly at
[email protected]
Joseph Hilbe
============================================
Date: Wed, 17 Dec 2008 10:36:00 +0000
From: "Ada Ma" <[email protected]>
Subject: st: overdispersion and underdispersion in nbreg / glm models
Dear Statalisters,
I'd been following Joseph Hilbe's book "Negative Binomial Regression"
(2007) and using some of my own data to try out methods laid out in
the book.
The book suggested that one can look at the Pearson's dispersion
output from the -glm- command to check if one's negative binomial
model is affected by underdispersion or overdispersion.
In the book it says that if one's model is affected by overdispersion,
it could be caused by missing explanatory variable. But my model
seems to be suggesting quite the opposite and I am not sure what to
do.
When I added an explanatory variable to the model the Pearson's stats
went from being underdispersed to overdispersed. Both models are
estimated using the -glm- command with the "family(nb XXX)" option
specified, XXX being the alpha value taken from the -nbreg- command
output. Although the AIC and BIC of the model with the additional
variable looks better (lower), I really don't know what is worse.
What I should do in order to resolve the dispersion problem and
frankly speaking, are there other things that would tell me which
model is better? Shall I bootstrap and jacknife???
All suggestions welcomed.
Regards,
Ada
- --
Ada Ma
Research Fellow
Health Economics Research Unit
University of Aberdeen, UK.
http://www.abdn.ac.uk/heru/
Tel: +44 (0) 1224 553863
Fax: +44 (0) 1224 550926
*
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/