|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: cluster and F test
--
�ngel:
On Jul 11, 2008, at 4:19 AM, �ngel Rodr�guez Laso wrote:
Dear Steven,
From my readings I've understood that the design effect comprises all
loss of precission due to clustering and weighting. Once the sample
size is corrected by the design effect, what matters is the number of
observations. These are the results for the proportion of a variable
in a complex design survey:
In your example, the DEFT is <1, indicating that the cluster
sample is more precise than a SRS. This would happen, for example,
if in every cluster, the proportion of "si" is about 10%, the
population proportion. Essentially, the "between-cluster" SD would
zero in the formula I previously presented. In such a case, the total
sample size matters, not the number of clusters.
In the general case, however, the between-cluster SD is not zero
This would happen is the trait you were studying was unevenly
distributed among clusters. The most extreme case: if all subjects
in 127 clusters had "si", and all subjects in the remaining 1,139
clusters had "non", then the effective sample size would be the
number of clusters. In the absence of stratification contributions
to the design effect, the approximate value of DEFF would be "n",
where "n" is the average cluster size.
What is surprising for me is that in regression in this context, only
the number of clusters count and not the number of individuals per
cluster (or the total number of individuals), as it's been said by
Austin. That amounts to saying that having 1000 observations per
cluster would yield the same precision than having 1.
You are misinterpreting Austin's statement (I could not find the one
you mean). Of course, the number of observations per cluster
matters, but only up to a point. The approximate formula for the
variance of a mean that I gave previously was:
var = [(s_b)^2]/m + [(s_w)^2]/nm.
where m = no. clusters, n = number of observations /cluster.
You can see that increasing n does decrease the variance, but this
decrease affects only the 2nd term. On the list we occasionally see
examples where investigators took a small number of clusters and a
huge sample size in some of them, and then were surprised at the big
standard errors. For more details, find the formulas for the design
effect and for choosing the sample size for clusters in one of the
texts I referred to.
(Aside: In your example DEFF = -5863. This is a number that should
be positive! According to the Stata manual, the value for DEFF is
valid only if original population weights are used. In your example
the weights are scaled to total the sample size, not the population
size, and this may have caused the wild value.)
-Steve
svyset psu [pweight=pesodef2007], strata(areasalud)fpc(secperarea)
pweight: pesodef2007
VCE: linearized
Strata 1: areasalud
SU 1: psu
FPC 1: secperarea
. svy:prop p45
(running proportion on estimation sample)
Survey: Proportion estimation
Number of strata = 11 Number of obs = 12174
Number of PSUs = 1266 Population size = 12172,5
Design df = 1255
--------------------------------------------------------------
| Linearized Binomial Wald
| Proportion Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
p45 |
s� | ,0994565 ,0023199 ,0949052 ,1040077
no | ,9005435 ,0023199 ,8959923 ,9050948
--------------------------------------------------------------
. estat effects
----------------------------------------------------------
| Linearized
| Proportion Std. Err. Deff Deft
-------------+--------------------------------------------
p45 |
s� | ,0994565 ,0023199 -5863 ,855246
no | ,9005435 ,0023199 -5863 ,855246
----------------------------------------------------------
Note: Weights must represent population totals for deff to be correct
when using an FPC; however, deft is
invariant to the scale of weights.
end of do-file
So the standard error is calculated on the effective sample size
(16648;
p(1-p)/se*se) that, if corrected by deft*deft becomes
(16648*0.855246*0.855246) 12177, much closer to the number of
observations than to the number of clusters. That�s the reason why I
comment that for precision, the sample size is a very important
determinant. In fact, there is no disagreement between both points of
views because the total sample size is determined by the number of
clusters and the number of observations per cluster.
What is surprising for me is that in regression in this context, only
the number of clusters count and not the number of individuals per
cluster (or the total number of individuals), as it's been said by
Austin. That amounts to saying that having 1000 observations per
cluster would yield the same precision than having 1.
Cheers,
�ngel
2008/7/8, Steven Samuels <[email protected]>:
Angel, the primary determinant of precision is the number of
clusters, and
degrees of freedom are based on these.
To compute the sample size needed in a cluster sample, you need to
estimate
the number of clusters needed *and* the number of observations per
cluster.
Consider an extreme case: everybody in a cluster has the same
value of an
outcome "Y", but the means differ between clusters. Here one
observation
will completely represent the cluster and only the number of clusters
matters. At the other extreme, if each cluster is a miniature of the
original population and cluster are very similar, then relatively few
clusters are needed and more observations can be taken per cluster.
In practice, the actual choice of clusters/observations per
cluster is made
on the basis of the budget, on the relative costs of adding a
cluster and of
adding an additional observation within a cluster, and the ratios
the SD's
for the main outcomes between and within clusters. As there are
usually
several outcomes, a compromise sample size is chosen. See: Sharon
Lohr,
Sampling: Design and Analysis, Duxbury, 1999, Chapter 5; WG Cochran,
Sampling Techniques, Wiley, 1977; L Kish, Survey Sampling, Wiley,
1965.
There are many internet references.
Key concepts: the intra-class correlation, which measures how similar
observations in the same clusters are compared to observations in
different
clusters; the "design effect", which shows how the standard error
of a
complex cluster sample is inflated compared to a simple random
sample of the
same number of observations. Joanne Garret's program -sampclus-,
(findit
sampclus), requires the investigator to input the correlation. It
is most
easily calculated by a variance components analysis of similar data.
A *theoretical* nested model can make some concepts clearer
(Lohr). Suppose
there are observations Y_ij = c + a_i + e_ij. There are m random
effects a_i
from a distribution with between-cluster SD s_b and, for each a_i,
there are
n e_ij's drawn from a distribution with "within-cluster" SD s_w.
The a's and
e's are independent. The total sample size is nm, and the variance
of the
sample mean is:
var = [(s_b)^2]/m + [(s_w)^2]/nm. You can see that, holding m fixed,
increasing the number of observations per cluster decreases only
the 2nd
term.
The actual formulas for sampling from finite populations are more
complicated, but the same principles apply.
-Steve
On Jul 8, 2008, at 5:07 AM, �ngel Rodr�guez Laso wrote:
Following the discussion, I don�t understand very well how
degrees of
freedom (number of clusters-number of strata) and the actual
number of
observations are used in svy commands (which are related to cluster
regression). I say so because when I calculate the sample size
needed
in a survey to get a proportion with a determined confidence level,
the number I get is the number of observations and not the number of
degrees of freedom. So I assume that the number of observations is
what conditions the standard error and then I don�t know what
degrees
of freedom are used for.
Cheers,
�ngel Rodr�guez
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/