Ricardo Ovaldia wrote:
> I am a bit baffled by the assertion that 50 clusters
> and 410 observations is a small sample size. I know is
> not big, but I would not consider it small either.
Whether 50 clusters and 410 total observations is small or not depends upon
the task. Advocating exercising caution to assure that the sample size is
adequate for the intended purpose is not asserting that a particular sample
size is small. For population-average GEE, which is sensitive to cluster
numbers, rules of thumb for sample size for ranges of predictors are given
in M. E. Stokes, C. S. Davis & G. G. Koch, _Categorical Data Analysis Using
the SAS System_ Second Edition. (Cary: N. Carolina: SAS Institute, 2000),
p. 479. If you have many candidate predictors among those for patients and
physicians, my guess is that the authors would say that 50 clusters is
pretty dicey.
I don't recall having recently run accross any corresponding guidance for
random-effects logistic regression, which depends more upon within-cluster
correlation and total observations. Can -simulate- tell you about the
adequacy of the sample size for your purposes (e.g., for confidence interval
coverage) in your particular dataset with the parameters set at their
estimates? Generating a correlated binary variate to match the observed rho
is tough, but you might be able to get reasonably close. If you're
satisfied with the results of the simulation for the model's intended use,
then the sample size is not too small.
In a simple-minded illustration below, a sample size of 50 clusters, a
uniform length (cluster size) of six observations and a moderate-to-high
within-cluster correlation (rho is about 80% or so), the test size was 11.5%
at the nominal 5% level of Type 1 error rate. That's more than double the
nominal, and if the purpose is hypothesis testing, then the sample size
would be considered small, too small given the nature of the data and the
objective. This improves, of course, when there is no within-cluster
correlation--in the simple example below it reduces to 6.7%, which is still
substantially larger than nominal. But if this isn't critical for the
objective, then the sample then would not necessarily be considered small.
> The question posed in this phase of analysis is rather
> simple: Which physician and patient characteristics
> are important in predicting patient referral?
Have you considered coupling modeling with graphical analysis at this phase?
Strength and nature of the relationships observed graphically could be
combined with knowledge of the subject matter to judge importance of
predictors. Plots could be made of observations or of predictions from
models after holding one or more covariates at reference values. If your
audience doesn't feel comfortable judging the strength or importance of the
relationship based upon what they can see by graphical presentation, then
numerical description of the predictions can be done either with summary
statistics (including tabulations) or by a model, perhaps with standardized
coefficients if that makes it easier for your audience. For the next phase,
the model can be made parsimonious based upon what's observed in the plots
or what's judged unimportant in earlier stages of exploration. It might be
beneficial to use two models to describe your observations: one, a
conditional logistic regression with physicians as groups, to describe
patient characteristics that predict referral; the other, a count model, to
describe physician characteristics that predict referral rates.
Joseph Coveney
----------------------------------------------------------------------------
clear
set more off
set seed 20040809
set obs 6
forvalues i = 1/6 {
generate float rho`i' = 0.8
replace rho`i' = 1 in `i'
}
mkmat rho*, matrix(A)
*
program define xtlogitsimc, rclass
version 8.2
drawnorm dep1 dep2 dep3 dep4 dep5 dep6, corr(A) n(50) clear
generate byte pid = _n
generate byte trt = _n > _N / 2
reshape long dep, i(pid) j(tim)
replace dep = dep > 0
compress
xi: xtlogit dep trt i.tim, i(pid) re
estimates store A
xtlogit dep, i(pid) re
estimates store B
lrtest A B
return scalar p = r(p)
end
*
simulate "xtlogitsimc" p = r(p), reps(1000)
generate byte pos = p < 0.05
replace pos = . if p >= .
summarize pos
*
*
*
program define xtlogitsimi, rclass
version 8.2
replace dep = uniform() > 0.5
xi: xtlogit dep trt i.tim, i(pid) re
estimates store A
xtlogit dep, i(pid) re
estimates store B
lrtest A B
return scalar p = r(p)
estimates drop _all
end
*
clear
set obs 50
generate byte pid = _n
generate byte trt = _n > _N / 2
forvalues i = 1/6 {
generate byte dep`i' = .
}
reshape long dep, i(pid) j(tim)
simulate "xtlogitsimi" p = r(p), reps(1000)
generate byte pos = p < 0.05
replace pos = . if p >= .
summarize pos
exit
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/