|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: cluster() or svy? (analysis of cluster-randomized trials)
From |
Steven Samuels <[email protected]> |
To |
[email protected] |
Subject |
Re: st: cluster() or svy? (analysis of cluster-randomized trials) |
Date |
Tue, 9 Sep 2008 17:23:57 -0400 |
Michael
In your -svyset- statement, you made a mistake unrelated to the
downward bias of the cluster-robust SE: you must designate treatment
group as a -stratum- variable. That will make the degrees of freedom
= 2, and lead to a nominal p = 0.052.
-Steve
On Sep 9, 2008, at 4:32 PM, Michael I. Lichter wrote:
Thanks to Austin and Jeph for responding. In reply to Jeph ...
I think there are good reasons to avoid both. You don't say what
kinds of analyses you have, but see ssc describe cltest for some
tools and a reference for analyzing cluster randomized outcomes
using adjustments to the standard chi-2 and t-tests.
Can you explain why to avoid both? Aren't they adjusting for the
same phenomenon--clustering of observations? I'll describe the
analyses, but that will take some background ...
This is a small trial of an intervention designed to promote
guideline-based diagnosis and treatment of patients with chronic
kidney disease (CKD). Four medical practices were selected and two
each were randomly assigned to control and intervention. (Yes, I
know that it is not recommended to do CRT with fewer than 5
clusters per arm.) Primary indicators include glomerular filtration
rate (GFR) and whether or not patients with substandard GFR were
diagnosed during the trial period has having CKD. We predict stable
or rising GFR in intervention practices compared to falling GFR in
control practices, and higher rates of physician-diagnosed CKD in
intervention practices compared to control practices. The universe
of patients is those with substandard GFR levels prior to the
intervention.
For GFR, I was planning to regress pre/post absolute change in GFR
on a dummy for control vs. not. (I'd like to include covariates
like age and sex, but don't have the degrees of freedom). In
partial answer to Austin's question about differences in results
between cluster() and svy, and also to ask about a problem with
clttest, I've included output below for this regression (1)
unclustered, (2) with the cluster() option, (3) with the svy
command, and (4) with clttest -- which isn't a regression but does
essentially the same thing in this instance.
. reg gfr_achg rcontrol /* unclustered */
Source | SS df MS Number of obs
= 159
-------------+------------------------------ F( 1,
157) = 0.30
Model | 30.4456806 1 30.4456806 Prob > F
= 0.5834
Residual | 15825.7933 157 100.801231 R-squared
= 0.0019
-------------+------------------------------ Adj R-
squared = -0.0044
Total | 15856.239 158 100.355943 Root MSE
= 10.04
----------------------------------------------------------------------
--------
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------
+----------------------------------------------------------------
rcontrol | -.9589666 1.744912 -0.55 0.583
-4.405498 2.487565
_cons | -.2553191 1.464482 -0.17 0.862
-3.147948 2.63731
----------------------------------------------------------------------
--------
. reg gfr_achg rcontrol, cluster(rsiteid) /* clustered */
Linear regression Number of
obs = 159
F( 1, 3)
= 17.62
Prob > F
= 0.0247
R-squared
= 0.0019
Number of clusters (rsiteid) = 4 Root
MSE = 10.04
----------------------------------------------------------------------
--------
| Robust
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------
+----------------------------------------------------------------
rcontrol | -.9589666 .2284233 -4.20 0.025 -1.685911
-.2320217
_cons | -.2553191 .223962 -1.14 0.337 -.
9680661 .4574278
----------------------------------------------------------------------
--------
. svy: reg gfr_achg rcontrol /* survey */
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 1 Number of obs
= 159
Number of PSUs = 4 Population size
= 159
Design df
= 3
F( 1, 3)
= 17.74
Prob > F
= 0.0245
R-squared
= 0.0019
----------------------------------------------------------------------
--------
| Linearized
gfr_achg | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------
+----------------------------------------------------------------
rcontrol | -.9589666 .2276993 -4.21 0.024 -1.683607
-.2343258
_cons | -.2553191 .2232521 -1.14 0.336 -.
965807 .4551687
----------------------------------------------------------------------
--------
. estat effects
----------------------------------------------------------
| Linearized
gfr_achg | Coef. Std. Err. Deff Deft
-------------+--------------------------------------------
rcontrol | -.9589666 .2276993 .012158 .110264
_cons | -.2553191 .2232521 .013712 .117098
----------------------------------------------------------
. clttest gfr_achg, by(rcontrol) cluster(rsiteid) /* clustered t-
test */
t-test adjusted for clustering
gfr_achg by rcontrol, clustered by rsiteid
----------------------------------------------------------------------
--
Intra-cluster correlation = -0.0267
----------------------------------------------------------------------
--
N Clusts Mean SE 95 % CI
rcontrol=0 47 2 -0.2553 0.7924 [-10.3243,
9.8137]
rcontrol=1 112 2 -1.2143 .
[ ., .]
----------------------------------------------------------------------
--
Combined 159 2 -0.9308 .
[ ., .]
----------------------------------------------------------------------
--
Diff(0-1) 159 4 0.9590 .
[ ., .]
Degrees freedom: 2
Ho: mean(-) = mean(diff) = 0
Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean
(diff) > 0
t = 2.1346 t = 2.1346 t =
2.1346
P < t = 0.9168 P > |t| = 0.1664 P > t =
0.0832
Suggestions on why the t-test didn't work (it didn't calculate SE)
would be welcome--it worked fine for a t-test of differences in the
post-GFR itself.
BTW, you might have noticed that the SEs are *smaller* in the
cluster/svy model compared to the unclustered model. That's because
the internal variation within clusters is much larger than the
differences between them--you can see this also in the deff and
deft being less than 1.0. Does this give me an excuse to treat the
data as unclustered?
On the other hand, when I look at ckd2 (diagnosed with CKD) for
those not diagnosed before the start of the study (ckd1 == 0), I
get a substantial design effect:
. svy: logit ckd2 rcontrol if ckd1==0
(running logit on estimation sample)
Survey: Logistic regression
Number of strata = 1 Number of obs
= 259
Number of PSUs = 4 Population size
= 259
Design df
= 3
F( 1, 3)
= 10.64
Prob > F
= 0.0471
----------------------------------------------------------------------
--------
| Linearized
ckd2 | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------
+----------------------------------------------------------------
rcontrol | -2.058052 .6309325 -3.26 0.047 -4.06596
-.0501428
_cons | 1.015231 .4127449 2.46 0.091 -.
2983078 2.328769
----------------------------------------------------------------------
--------
. estat effects
----------------------------------------------------------
| Linearized
ckd2 | Coef. Std. Err. Deff Deft
-------------+--------------------------------------------
rcontrol | -2.058052 .6309325 4.61385 2.14799
_cons | 1.015231 .4127449 3.11419 1.76471
----------------------------------------------------------
Does all that make sense?
Another preferred option is to use panel methods such as -xtmixed-
with the clusters specified as panels. Even
if you don't have covariates (and in an RCT you will need to make
a case for including them), these are often
preferred.
This is preferred because ... ?
. xtmixed gfr_achg rcontrol
Mixed-effects REML regression Number of obs
= 159
Wald chi2(1)
= 0.30
Log restricted-likelihood = -589.18999 Prob > chi2
= 0.5826
----------------------------------------------------------------------
--------
gfr_achg | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------
+----------------------------------------------------------------
rcontrol | -.9589666 1.744912 -0.55 0.583
-4.378931 2.460998
_cons | -.2553191 1.464482 -0.17 0.862
-3.125651 2.615013
----------------------------------------------------------------------
--------
----------------------------------------------------------------------
--------
Random-effects Parameters | Estimate Std. Err. [95% Conf.
Interval]
-----------------------------
+------------------------------------------------
sd(Residual) | 10.03998 .5630142
8.994974 11.2064
----------------------------------------------------------------------
--------
More detail on your design might produce more detailed answers.
See above.
Hope this helps,
Jeph
It does help. Thanks!
Michael I. Lichter wrote:
Hello, friends. I have a question about the analysis of data from
cluster-randomized trials (CRTs). CRTs are experiments where
subjects are randomly assigned to conditions (control, treatment)
based on their group membership rather than being assigned
individually as is usually the case in randomized controlled
trials. In my study, the clusters are medical practices, so when
a medical practice is assigned to a condition, all of the
eligible patients therein are also assigned to the condition.
CRTs should be analyzed using methods that take account of the
clustering in the study design, of course.
My question is this: For CRTs, is there any statistical reason
for preferring the cluster() option on estimation commands (e.g.,
regress, logit) over the survey commands, or vice-versa? I've
used both and the results are similar, but the survey commands
estimate larger standard errors. If the answer is that they're
both equally appropriate but produce different results because
they use somewhat different methods of estimation, that's fine.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/