Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: interpretting the estat gof commands and Hosmer Lemeshow version of it
From
Doug Hess <[email protected]>
To
[email protected]
Subject
st: interpretting the estat gof commands and Hosmer Lemeshow version of it
Date
Sun, 18 Sep 2011 14:09:40 -0400
Given all the cautions in Hosmer & Lemeshow's book, I'm a bit confused
as to what role and what interpretation should be given to the tests
that -estat gof- produces with and with the -group- option. The
results are below.
Without the grouping option, Peason chi2 gives P>chi2= 0.9999.
However, with groups (and the number of groups doesn't seem to matter
unless you have a very large number), the Hosmer Lemeshow method gives
P>chi2=0.0000. From the R manual (p.958-9) and Hosmer & Lemeshow's
book (p.150 of the 2000 edition) I gather that the null hypothesis is
the same for both.
So, why the large difference? Is one more appropriate, or do both have
problems when the outcome is somewhat rare (11 percent of observations
have y=1 in my case).
I see in Stata's R manual it says "However, the number of covariate
patterns is close to the number of observations, making the
applicability of the Pearson chi 2 test questionable but not
necessarily inappropriate" (p. 958). I have roughly 140,000
observations (households) and roughly 109,000 covariate patterns. If
this difference is important in deciding which of these tests to use,
what is the threshold for close are far distance between number of
observations and number of patterns? (It may help to know that there
are only roughly 70,000 covariate patterns, half the sample size
number, if I remove a half dozen continuous variables (which I am
thinking of doing by collapsing them into one or two scales or
factors).)
If it helps, here are some additional details: My logistic model (11
percent of observations are y=1) has an optimal cutoff point for
maximizing the senstivity and specifcity at 0.10, which gives
approximately 75 percent for both senstivity and specifcity. The area
under the ROC curve is 0.83. I'm using Stata 12.
. estat gof
number of observations = 143585
number of covariate patterns = 108638
Pearson chi2(108575) = 106784.16
Prob > chi2 = 0.9999
. estat gof, g(10) table
number of observations = 143585
number of groups = 10
Hosmer-Lemeshow chi2(8) = 322.31
Prob > chi2 = 0.0000
Decile Pred Prob Obs y=1 Exp y=1 Total Diff % diff
1 0.019 115 190 14,359 75 65%
2 0.025 194 315 14,358 121 62%
3 0.034 305 419 14,359 114 37%
4 0.044 443 560 14,359 117 26%
5 0.055 671 704 14,361 33 5%
6 0.072 864 904 14,355 40 5%
7 0.100 1,379 1,213 14,359 166 12%
8 0.163 2,122 1,827 14,358 295 14%
9 0.302 3,615 3,207 14,359 408 11%
10 0.856 6,175 6,543 14,358 368 6%
Sum= 15,883 15,883 143,585
I removed the observed and expected columns for y=0 for
formatting/simplicity. The column diff is the absolute value of Obs
minus Exp. The last column is that previous value as a percentage of
Obs y=1.
Thank you.
-Doug
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/