Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Steve Rothenberg" <drlead@prodigy.net.mx> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | st: significant gof with poisson in large N regression |
Date | Mon, 14 May 2012 14:07:00 -0500 |
I'm running -poisson- to model data with over 4300 observation periods with an "exposure" variable. The goal of the model is to get good predictions of mean counts given the explanatory variables. The models are stratified by age range and counts run from 0 to greater than 14. On the preliminary analysis using all data, unstratified by age, -estat gof- returns a non-significant chi2. On several of the age-stratified models, -estat gof- returns results similar to below: Goodness-of-fit chi2 = 4902.563 Prob > chi2(4334) = 0.0000 The model contains many highly significant variables. Predicted Pr(y=k) from Poisson differs from observed by at most less than 0.01 and for most count bins less than 0.001. I've read the manual entries on count data regressions as well as for -estat gof-. I've searched Statalist and have found other large N Poisson regressions with similarly significant gof statistics, but no discussion about the power of -estat gof- for large N regressions. I've tried applying constraints for groups of related categorical variables to fix coefficients at a common value across categories for categories with nearly identical coefficients, as suggested in the manual, without success in lowering the gof statistic. I've also (reluctantly) tried trimming the models of non-significant variables, again without changing the gof statistic much. I've tried various transformations of the continuous explanatory variables, again without success. I've tried running -nbreg- but the alpha statistic is somewhere in the neighborhood of 5 e-07, close enough to zero to rule out negative binomial. There is considerable heterogeneity of residuals around some of the continuous explanatory variables, an expected pattern given the nature of these variables. I've used robust estimation of standard errors, though I don't think that -estat gof- takes this into account. I've thought of using the non-stratified data set with variables to identify the counts of age strata plus the interactions of the age strata with key explanatory variables, but rearranging the data set to include these variables would be a very long process. I am also wary of interaction terms in non-linear models where different arms of the interaction have different variances. I've seen enough "diagnostic" statistics, such as normality tests, return highly significant results when the N was very large, even though graphically there appears to be little deviation from the theoretical distribution. Is it possible that the power of -estat gof- is such that, with large N analyses, even small deviations from expected values will return significant gof statistics? I'd appreciate advice on whether to go with my eyeball comparison of predicted versus observed probabilities of counts and ignore the -estat gof- statistic or if that statistic in these circumstances really indicates a poorly fitted model. Of course additional advice on other steps I could take to reduce the gof statistic would also be welcome. Best, Steve Rothenberg Instituto Nacional de Salud Publica Cuernavaca, Morelos, Mexico Stata 12.1 MP, Windows 7 * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/