Christopher--
Yes, doomed to failure. Ha ha! just kidding. Sort of. Without a
geographic identifier, you cannot apply the right correction for
clustering and allow for arbitrary intra-cluster correlation of
errors, but you can guess at the intra-cluster correlation of errors
and apply a design effect "inflator" from Kish (1965), given by
deff=[1+roh(b-1)]
where b is the number of individuals per cluster (the assumption of
equal cluster size seems not so important as long as there are no
outliers) and roh is the coefficient of intraclass correlation (ICC).
Then you multiply your SE by the square root of the deff. This will
get you close to the right answer, and is better than nothing (when I
say nothing I mean clustering at the household level, which is not
nothing, of course).
See page 162 and surrounding of
Kish, Leslie (1965). Survey Sampling
In practice, you might ask ESDS about the number of individuals per
postal code, and then guess the roh (ICC). Maybe those numbers are
100 and .02, in which case the deff is about 4, and you have to double
your standard errors. Or maybe those numbers are 10 and .01, in which
case you can probably ignore clustering at the postal code level
(multiply SE by 1.044).
On 9/26/06, Christopher W. Ryan <[email protected]> wrote:
So I received a reply from the keepers of the data, ESDS Government:
"Thanks for your query. Yes you have understand correctly that postal
code is used as the PSU. Unfortunately you won't find this or strata in
the HSE datasets because of concerns over confidentiality. This is
something that we are going to raise with ONS and other data providers
as it is definitely one of the shortfalls with the datasets so thank you
for raising the issue. I'm sorry I can't bring you any better news."
So knowing that the data are from a complex multistage sampling design,
but having no access to the psu information, what would be the best way
to proceed with analysis?
I am using Stata 8 and trying to investigate the association between
colon symptoms (28 in the HSE variables illsm1 through illsm6) and
various indicators of child behavior (HSE variables sdq*, for example).
For instance, I would generate a new variable colon=1 if any illsm*
==28, and zero otherwise. Then regress sdq_hyp (a hyperactivity scale)
on colon and several other control variables like age, household income,
educational attainment of parent, and so forth.
Is this doomed to failure without the sampling design variables?
Each selected househould could contribute up to two children as
subjects. For households with 2 kids or less, all of them were
subjects. For households with 3 or more kids, 2 were selected. Would
simply using -regress- and clustering on hserial (household serial
number) be beneficial?
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/