David Airey wrote:
But when stuck with a small data set, why not run a model designed for
that data structure, as opposed to running a model not designed for the
data structure? When does ignoring the clustering become more favorable
to acknowledging the presence of fewer than an optimal number clusters?
Why is it not the case that a good model on a small data set is not
always better than a bad model on the same small data set? I hope I'm
clear.
-------------------------------------------------------------------------------
Both the population-average GEE and subject-specific maximum likelihood
approaches are considered "large-sample," and that was the basis of my
suggesting caution with Ricardo's sample size of 50. There wasn't any
intention to suggest as alternatives either attempting to fit a model that
isn't suited to the task or ignoring clustering.
It might be helpful to conceptualize the objectives of modeling in an
exploratory-confirmatory dichotomy or continuum. On the one hand, modeling
could be used in exploring data in the hope that the exercise will provide
insight. Some would argue that model-fitting is tendentious for this.
Exploratory use would also include using models to concisely describe insights
gleaned from other exploratory methods. An example of this came up on the list
last month in the use of -ologit- with an interaction term to describe
numerically (and perhaps lend corroboration to) what is observed in -ordplot-
or -distplot-. On the other hand, modeling can be used to estimate parameters
and make formal statements about them, including confidence interval
construction and hypothesis testing. In this latter usage, attention to sample
size requirements (and other assumptions) would be especially important,
although I wouldn't throw caution to the wind in exploratory usage, either.
Ricardo didn't mention what the objective is of his usage. If it involves the
latter type, I could imagine a reviewer--either a journal referee or a
regulatory agency reviewer--answering David's question in stating that a good
model on a small dataset is not better than a bad model on the same dataset
when the sample size is not sufficient for the good model's intended use.
And I wouldn't count on a pre-emptive mea culpa plea acknowledging the presence
of fewer than an optimal (adequate) number of clusters to get me off the hook
in this situation.
--------------------------------------------------------------------------------
Ricardo Ovaldia wrote:
. . . I have a couple of follow-up questions:
> If there
> is a substantial correlation between the fixed
> effects (physician covariates) and the random
> effect, then the parameters are liable not to be
> consistently estimated.
How can I test this?
. . . I guess the question that remains is whether or not I
can justifiably use this approach?
-------------------------------------------------------------------------------
Hausman's test, which Stata has in -hausman- or -xthausman-, is what I am aware
of to test this assumption in linear models. Others on the list might be able
to help you, but I'm not familiar enough with how this assumption is evaluated
in nonlinear models, like logistic regression, where devising the proper
comparison fixed-effects model is tricky. Is is possible to test the
assumption for patient predictors using Hausman's test (using -hausman- or
-suest-) with -clogit- (consistent) against -xtlogit, re- (efficient /
consistent-under-the-null)?
You might consider fitting the model using -gllamm-, perhaps without the
predictor in question, generating the random effects predictions using
-gllapred- and examining scatterplots between them and the various predictors.
I'm unaware of anyone in-the-know suggesting this approach, and so I suspect
that it would have difficulty withstanding scrutiny.
The preferred approach might be to base the assumption's tenability externally
upon knowledge of the area, relegating testing of the assumption to the same
status as use of Levene's test or Bartlett's test prior to ANOVA. Ricardo's
data seem to come from an observational study. Others on the list can speak
much more authoritatively, but it's been my impression that random-effects
regression is not especially favored in such circumstances, because the
assumption that random effects and predictors are uncorrelated is likely to be
violated in them and a failure to reject the null hypothesis in the Hausman
test would only be cold comfort.
In ignorance about Ricardo's objectives and the subject matter, about the only
suggestions that come to mind about whether the approach is justifiable is to
consider the assumptions made in the process. In addition to the two already
mentioned, another would be the degree to which physicians (or their
predictors) predict patient predictors.
Joseph Coveney
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/