|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: cluster() or svy? (analysis of cluster-randomized trials)
From |
"Michael I. Lichter" <[email protected]> |
To |
[email protected] |
Subject |
Re: st: cluster() or svy? (analysis of cluster-randomized trials) |
Date |
Tue, 09 Sep 2008 16:32:54 -0400 |
Thanks to Austin and Jeph for responding. In reply to Jeph ...
I think there are good reasons to avoid both. You don't say what kinds
of analyses you have, but see ssc describe cltest for some tools and
a reference for analyzing cluster randomized outcomes using
adjustments to the standard chi-2 and t-tests.
Can you explain why to avoid both? Aren't they adjusting for the same
phenomenon--clustering of observations? I'll describe the analyses, but
that will take some background ...
This is a small trial of an intervention designed to promote
guideline-based diagnosis and treatment of patients with chronic kidney
disease (CKD). Four medical practices were selected and two each were
randomly assigned to control and intervention. (Yes, I know that it is
not recommended to do CRT with fewer than 5 clusters per arm.) Primary
indicators include glomerular filtration rate (GFR) and whether or not
patients with substandard GFR were diagnosed during the trial period has
having CKD. We predict stable or rising GFR in intervention practices
compared to falling GFR in control practices, and higher rates of
physician-diagnosed CKD in intervention practices compared to control
practices. The universe of patients is those with substandard GFR levels
prior to the intervention.
For GFR, I was planning to regress pre/post absolute change in GFR on a
dummy for control vs. not. (I'd like to include covariates like age and
sex, but don't have the degrees of freedom). In partial answer to
Austin's question about differences in results between cluster() and
svy, and also to ask about a problem with clttest, I've included output
below for this regression (1) unclustered, (2) with the cluster()
option, (3) with the svy command, and (4) with clttest -- which isn't a
regression but does essentially the same thing in this instance.
. reg gfr_achg rcontrol /* unclustered */
Source | SS df MS Number of obs
= 159
-------------+------------------------------ F( 1, 157)
= 0.30
Model | 30.4456806 1 30.4456806 Prob > F =
0.5834
Residual | 15825.7933 157 100.801231 R-squared =
0.0019
-------------+------------------------------ Adj R-squared =
-0.0044
Total | 15856.239 158 100.355943 Root MSE =
10.04
------------------------------------------------------------------------------
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
rcontrol | -.9589666 1.744912 -0.55 0.583 -4.405498
2.487565
_cons | -.2553191 1.464482 -0.17 0.862 -3.147948
2.63731
------------------------------------------------------------------------------
. reg gfr_achg rcontrol, cluster(rsiteid) /* clustered */
Linear regression Number of obs
= 159
F( 1, 3) =
17.62
Prob > F =
0.0247
R-squared =
0.0019
Number of clusters (rsiteid) = 4 Root MSE =
10.04
------------------------------------------------------------------------------
| Robust
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
rcontrol | -.9589666 .2284233 -4.20 0.025 -1.685911
-.2320217
_cons | -.2553191 .223962 -1.14 0.337 -.9680661
.4574278
------------------------------------------------------------------------------
. svy: reg gfr_achg rcontrol /* survey */
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 1 Number of obs
= 159
Number of PSUs = 4 Population size
= 159
Design df
= 3
F( 1, 3) =
17.74
Prob > F =
0.0245
R-squared =
0.0019
------------------------------------------------------------------------------
| Linearized
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
rcontrol | -.9589666 .2276993 -4.21 0.024 -1.683607
-.2343258
_cons | -.2553191 .2232521 -1.14 0.336 -.965807
.4551687
------------------------------------------------------------------------------
. estat effects
----------------------------------------------------------
| Linearized
gfr_achg | Coef. Std. Err. Deff Deft
-------------+--------------------------------------------
rcontrol | -.9589666 .2276993 .012158 .110264
_cons | -.2553191 .2232521 .013712 .117098
----------------------------------------------------------
. clttest gfr_achg, by(rcontrol) cluster(rsiteid) /* clustered t-test */
t-test adjusted for clustering
gfr_achg by rcontrol, clustered by rsiteid
------------------------------------------------------------------------
Intra-cluster correlation = -0.0267
------------------------------------------------------------------------
N Clusts Mean SE 95 % CI
rcontrol=0 47 2 -0.2553 0.7924 [-10.3243, 9.8137]
rcontrol=1 112 2 -1.2143 . [ ., .]
------------------------------------------------------------------------
Combined 159 2 -0.9308 . [ ., .]
------------------------------------------------------------------------
Diff(0-1) 159 4 0.9590 . [ ., .]
Degrees freedom: 2
Ho: mean(-) = mean(diff) = 0
Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0
t = 2.1346 t = 2.1346 t = 2.1346
P < t = 0.9168 P > |t| = 0.1664 P > t = 0.0832
Suggestions on why the t-test didn't work (it didn't calculate SE) would
be welcome--it worked fine for a t-test of differences in the post-GFR
itself.
BTW, you might have noticed that the SEs are *smaller* in the
cluster/svy model compared to the unclustered model. That's because the
internal variation within clusters is much larger than the differences
between them--you can see this also in the deff and deft being less than
1.0. Does this give me an excuse to treat the data as unclustered?
On the other hand, when I look at ckd2 (diagnosed with CKD) for those
not diagnosed before the start of the study (ckd1 == 0), I get a
substantial design effect:
. svy: logit ckd2 rcontrol if ckd1==0
(running logit on estimation sample)
Survey: Logistic regression
Number of strata = 1 Number of obs
= 259
Number of PSUs = 4 Population size
= 259
Design df
= 3
F( 1, 3) =
10.64
Prob > F =
0.0471
------------------------------------------------------------------------------
| Linearized
ckd2 | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
rcontrol | -2.058052 .6309325 -3.26 0.047 -4.06596
-.0501428
_cons | 1.015231 .4127449 2.46 0.091 -.2983078
2.328769
------------------------------------------------------------------------------
. estat effects
----------------------------------------------------------
| Linearized
ckd2 | Coef. Std. Err. Deff Deft
-------------+--------------------------------------------
rcontrol | -2.058052 .6309325 4.61385 2.14799
_cons | 1.015231 .4127449 3.11419 1.76471
----------------------------------------------------------
Does all that make sense?
Another preferred option is to use panel methods such as -xtmixed-
with the clusters specified as panels. Even
if you don't have covariates (and in an RCT you will need to make a
case for including them), these are often
preferred.
This is preferred because ... ?
. xtmixed gfr_achg rcontrol
Mixed-effects REML regression Number of obs
= 159
Wald chi2(1)
= 0.30
Log restricted-likelihood = -589.18999 Prob > chi2 =
0.5826
------------------------------------------------------------------------------
gfr_achg | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
rcontrol | -.9589666 1.744912 -0.55 0.583 -4.378931
2.460998
_cons | -.2553191 1.464482 -0.17 0.862 -3.125651
2.615013
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf.
Interval]
-----------------------------+------------------------------------------------
sd(Residual) | 10.03998 .5630142 8.994974
11.2064
------------------------------------------------------------------------------
More detail on your design might produce more detailed answers.
See above.
Hope this helps,
Jeph
It does help. Thanks!
Michael I. Lichter wrote:
Hello, friends. I have a question about the analysis of data from
cluster-randomized trials (CRTs). CRTs are experiments where subjects
are randomly assigned to conditions (control, treatment) based on
their group membership rather than being assigned individually as is
usually the case in randomized controlled trials. In my study, the
clusters are medical practices, so when a medical practice is
assigned to a condition, all of the eligible patients therein are
also assigned to the condition. CRTs should be analyzed using methods
that take account of the clustering in the study design, of course.
My question is this: For CRTs, is there any statistical reason for
preferring the cluster() option on estimation commands (e.g.,
regress, logit) over the survey commands, or vice-versa? I've used
both and the results are similar, but the survey commands estimate
larger standard errors. If the answer is that they're both equally
appropriate but produce different results because they use somewhat
different methods of estimation, that's fine.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/