Maartin Buis directed me to a short paper of his: "Unobserved heterogeneity
in logistic regression":
http://home.fsw.vu.nl/m.buis/
The concept makes sense--the question is what to do about it.
I am using in-hospital mortality as an outcome in a multivariable logistic
model, focusing on a particular laboratory test (troponin I) as a predictor
(either with simple log transformation, or using -mfp-). I test
independence by doing nested logistic models with every other mortality
predictor that I can find (some continuous, some dichotomous), and the odds
ratio for the test of interest remains stable (and Hosmer Lemeshow goodness
of fit stats do not reject the models). My sample sizes are on the order of
10,000-30,000 observations per data set.
The overall mortality is 3.3% and the predictor of interest is strongly
skewed to the left (see below).
My questions are these:
There are of course many unobserved causes for in-hospital mortality, but
insofar as this particular model seems to work, do I need to deal with this?
If one does try to deal with it in a situation such as mine, is it a matter
of using a method other than simple logistic regression to fit the model, or
is it more a matter of assessment of goodness if fit?
In either case, can anybody point me in the right direction (reference-wise)
toward (1) assessing the degree of unobserved heterogeneity (2) fitting a
model which deals with it if it exists and (3) testing the model?
(I am one of those dangerous physician researchers who has more computing
power than formal statistical training, although I am trying).
Some output follows:
Note that zlog is the log-10 transformation of the predictor of interest
(troponin) with (troponin==0) represented by the dummy variable (zero==1)
zlog replaced
with zero where (troponin==0). Don't get too caught up in this part--it
works.
************************
*Univariate (calling zlog and zero one predictor)
. logistic is_dead zlog zero if instudy==1 & bimc==1
Logistic regression Number of obs = 13207
LR chi2(2) = 146.66
Prob > chi2 = 0.0000
Log likelihood = -1767.6459 Pseudo R2 = 0.0398
--------------------------------------------------------------------
is_dead | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------+----------------------------------------------------------
zlog | 1.943987 .1429218 9.04 0.000 1.683113 2.245296
zero | .2050451 .0267263 12.16 0.000 .1588184 .2647268
--------------------------------------------------------------------
. sum zlog, detail
zlog
------------------------------------------------------------
Percentiles Smallest
1% -2 -2.69897
5% -2 -2.69897
10% -1.69897 -2.522879 Obs 47062
25% -1.30103 -2.39794 Sum of Wgt. 47062
50% -.30103 Mean -.6325129
Largest Std. Dev. .7529837
75% 0 1.942504
90% 0 1.960471 Variance .5669845
95% 0 1.96708 Skewness -.4340483
99% .6532125 1.975891 Kurtosis 2.02339
. estat gof if e(sample),group(10) table
Logistic model for is_dead, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
(There are only 7 distinct quantiles because of ties)
+---------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+-------+-------+-------+--------+-------|
| 4 | 0.0176 | 105 | 105.0 | 5862 | 5862.0 | 5967 |
| 5 | 0.0226 | 33 | 29.8 | 1287 | 1290.2 | 1320 |
| 6 | 0.0275 | 23 | 24.4 | 866 | 864.6 | 889 |
| 7 | 0.0355 | 43 | 45.6 | 1348 | 1345.4 | 1391 |
| 8 | 0.0430 | 60 | 50.3 | 1181 | 1190.7 | 1241 |
|-------+--------+-------+-------+-------+--------+-------|
| 9 | 0.0553 | 42 | 53.4 | 1042 | 1030.6 | 1084 |
| 10 | 0.2452 | 108 | 105.5 | 1207 | 1209.5 | 1315 |
+---------------------------------------------------------+
number of observations = 13207
number of groups = 7
Hosmer-Lemeshow chi2(5) = 5.20
Prob > chi2 = 0.3924
*Now, adding age (unstransformed) to the previous model:
. logistic is_dead zlog zero age if instudy==1 & bimc==1
Logistic regression Number of obs = 13207
LR chi2(3) = 254.38
Prob > chi2 = 0.0000
Log likelihood = -1713.7849 Pseudo R2 = 0.0691
------------------------------------------------------- ---------
is_dead | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------- ---------
zlog | 1.905635 .1413964 8.69 0.000 1.647712 2.203932
zero | .2252449 .0295986 -11.34 0.000 .1741012 .2914125
age | 1.035355 .003652 9.85 0.000 1.028222 1.042538
------------------------------------------------------- ---------
. estat gof if e(sample),group(10) table
Logistic model for is_dead, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
+---------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+-------+-------+-------+--------+-------|
| 1 | 0.0090 | 8 | 9.3 | 1314 | 1312.7 | 1322 |
| 2 | 0.0121 | 16 | 13.9 | 1304 | 1306.1 | 1320 |
| 3 | 0.0155 | 22 | 18.2 | 1299 | 1302.8 | 1321 |
| 4 | 0.0192 | 26 | 22.8 | 1294 | 1297.2 | 1320 |
| 5 | 0.0232 | 24 | 28.0 | 1297 | 1293.0 | 1321 |
|-------+--------+-------+-------+-------+--------+-------|
| 6 | 0.0281 | 40 | 33.9 | 1281 | 1287.1 | 1321 |
| 7 | 0.0340 | 35 | 40.8 | 1285 | 1279.2 | 1320 |
| 8 | 0.0434 | 45 | 50.4 | 1276 | 1270.6 | 1321 |
| 9 | 0.0641 | 70 | 69.1 | 1251 | 1251.9 | 1321 |
| 10 | 0.4050 | 128 | 127.5 | 1192 | 1192.5 | 1320 |
+---------------------------------------------------------+
number of observations = 13207
number of groups = 10
Hosmer-Lemeshow chi2(8) = 4.91
Prob > chi2 = 0.7672
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/