Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: cluster() or svy? (analysis of cluster-randomized trials)


From   "Michael I. Lichter" <[email protected]>
To   [email protected]
Subject   Re: st: cluster() or svy? (analysis of cluster-randomized trials)
Date   Tue, 09 Sep 2008 16:32:54 -0400

Thanks to Austin and Jeph for responding. In reply to Jeph ...
I think there are good reasons to avoid both. You don't say what kinds of analyses you have, but see ssc describe cltest for some tools and a reference for analyzing cluster randomized outcomes using adjustments to the standard chi-2 and t-tests.
Can you explain why to avoid both? Aren't they adjusting for the same phenomenon--clustering of observations? I'll describe the analyses, but that will take some background ...

This is a small trial of an intervention designed to promote guideline-based diagnosis and treatment of patients with chronic kidney disease (CKD). Four medical practices were selected and two each were randomly assigned to control and intervention. (Yes, I know that it is not recommended to do CRT with fewer than 5 clusters per arm.) Primary indicators include glomerular filtration rate (GFR) and whether or not patients with substandard GFR were diagnosed during the trial period has having CKD. We predict stable or rising GFR in intervention practices compared to falling GFR in control practices, and higher rates of physician-diagnosed CKD in intervention practices compared to control practices. The universe of patients is those with substandard GFR levels prior to the intervention.

For GFR, I was planning to regress pre/post absolute change in GFR on a dummy for control vs. not. (I'd like to include covariates like age and sex, but don't have the degrees of freedom). In partial answer to Austin's question about differences in results between cluster() and svy, and also to ask about a problem with clttest, I've included output below for this regression (1) unclustered, (2) with the cluster() option, (3) with the svy command, and (4) with clttest -- which isn't a regression but does essentially the same thing in this instance.

. reg gfr_achg rcontrol /* unclustered */

Source | SS df MS Number of obs = 159
-------------+------------------------------ F( 1, 157) = 0.30
Model | 30.4456806 1 30.4456806 Prob > F = 0.5834
Residual | 15825.7933 157 100.801231 R-squared = 0.0019
-------------+------------------------------ Adj R-squared = -0.0044
Total | 15856.239 158 100.355943 Root MSE = 10.04

------------------------------------------------------------------------------
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rcontrol | -.9589666 1.744912 -0.55 0.583 -4.405498 2.487565
_cons | -.2553191 1.464482 -0.17 0.862 -3.147948 2.63731
------------------------------------------------------------------------------

. reg gfr_achg rcontrol, cluster(rsiteid) /* clustered */

Linear regression Number of obs = 159
F( 1, 3) = 17.62
Prob > F = 0.0247
R-squared = 0.0019
Number of clusters (rsiteid) = 4 Root MSE = 10.04

------------------------------------------------------------------------------
| Robust
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rcontrol | -.9589666 .2284233 -4.20 0.025 -1.685911 -.2320217
_cons | -.2553191 .223962 -1.14 0.337 -.9680661 .4574278
------------------------------------------------------------------------------

. svy: reg gfr_achg rcontrol /* survey */
(running regress on estimation sample)

Survey: Linear regression

Number of strata = 1 Number of obs = 159
Number of PSUs = 4 Population size = 159
Design df = 3
F( 1, 3) = 17.74
Prob > F = 0.0245
R-squared = 0.0019

------------------------------------------------------------------------------
| Linearized
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rcontrol | -.9589666 .2276993 -4.21 0.024 -1.683607 -.2343258
_cons | -.2553191 .2232521 -1.14 0.336 -.965807 .4551687
------------------------------------------------------------------------------

. estat effects

----------------------------------------------------------
| Linearized
gfr_achg | Coef. Std. Err. Deff Deft
-------------+--------------------------------------------
rcontrol | -.9589666 .2276993 .012158 .110264
_cons | -.2553191 .2232521 .013712 .117098
----------------------------------------------------------

. clttest gfr_achg, by(rcontrol) cluster(rsiteid) /* clustered t-test */

t-test adjusted for clustering
gfr_achg by rcontrol, clustered by rsiteid
------------------------------------------------------------------------
Intra-cluster correlation = -0.0267
------------------------------------------------------------------------
N Clusts Mean SE 95 % CI
rcontrol=0 47 2 -0.2553 0.7924 [-10.3243, 9.8137]
rcontrol=1 112 2 -1.2143 . [ ., .]
------------------------------------------------------------------------
Combined 159 2 -0.9308 . [ ., .]
------------------------------------------------------------------------
Diff(0-1) 159 4 0.9590 . [ ., .]

Degrees freedom: 2

Ho: mean(-) = mean(diff) = 0

Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0
t = 2.1346 t = 2.1346 t = 2.1346
P < t = 0.9168 P > |t| = 0.1664 P > t = 0.0832


Suggestions on why the t-test didn't work (it didn't calculate SE) would be welcome--it worked fine for a t-test of differences in the post-GFR itself.

BTW, you might have noticed that the SEs are *smaller* in the cluster/svy model compared to the unclustered model. That's because the internal variation within clusters is much larger than the differences between them--you can see this also in the deff and deft being less than 1.0. Does this give me an excuse to treat the data as unclustered?

On the other hand, when I look at ckd2 (diagnosed with CKD) for those not diagnosed before the start of the study (ckd1 == 0), I get a substantial design effect:

. svy: logit ckd2 rcontrol if ckd1==0
(running logit on estimation sample)

Survey: Logistic regression

Number of strata = 1 Number of obs = 259
Number of PSUs = 4 Population size = 259
Design df = 3
F( 1, 3) = 10.64
Prob > F = 0.0471

------------------------------------------------------------------------------
| Linearized
ckd2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rcontrol | -2.058052 .6309325 -3.26 0.047 -4.06596 -.0501428
_cons | 1.015231 .4127449 2.46 0.091 -.2983078 2.328769
------------------------------------------------------------------------------

. estat effects

----------------------------------------------------------
| Linearized
ckd2 | Coef. Std. Err. Deff Deft
-------------+--------------------------------------------
rcontrol | -2.058052 .6309325 4.61385 2.14799
_cons | 1.015231 .4127449 3.11419 1.76471
----------------------------------------------------------

Does all that make sense?

Another preferred option is to use panel methods such as -xtmixed- with the clusters specified as panels. Even
if you don't have covariates (and in an RCT you will need to make a case for including them), these are often
preferred.
This is preferred because ... ?

. xtmixed gfr_achg rcontrol

Mixed-effects REML regression Number of obs = 159

Wald chi2(1) = 0.30
Log restricted-likelihood = -589.18999 Prob > chi2 = 0.5826

------------------------------------------------------------------------------
gfr_achg | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rcontrol | -.9589666 1.744912 -0.55 0.583 -4.378931 2.460998
_cons | -.2553191 1.464482 -0.17 0.862 -3.125651 2.615013
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
sd(Residual) | 10.03998 .5630142 8.994974 11.2064
------------------------------------------------------------------------------


More detail on your design might produce more detailed answers.
See above.
Hope this helps,
Jeph
It does help. Thanks!
Michael I. Lichter wrote:
Hello, friends. I have a question about the analysis of data from cluster-randomized trials (CRTs). CRTs are experiments where subjects are randomly assigned to conditions (control, treatment) based on their group membership rather than being assigned individually as is usually the case in randomized controlled trials. In my study, the clusters are medical practices, so when a medical practice is assigned to a condition, all of the eligible patients therein are also assigned to the condition. CRTs should be analyzed using methods that take account of the clustering in the study design, of course.

My question is this: For CRTs, is there any statistical reason for preferring the cluster() option on estimation commands (e.g., regress, logit) over the survey commands, or vice-versa? I've used both and the results are similar, but the survey commands estimate larger standard errors. If the answer is that they're both equally appropriate but produce different results because they use somewhat different methods of estimation, that's fine.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index