Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: How to svyset when strata are used in some groups and not others
From
Stas Kolenikov <[email protected]>
To
[email protected]
Subject
Re: st: How to svyset when strata are used in some groups and not others
Date
Mon, 5 Jul 2010 10:54:11 -0500
The biggest problem you have is that of the units and populations of
sampling and analysis.
What you told Stata was: "I have a population of patients admitted to
these hospitals in the given period of time".
You really did sample close to 100% of the population, which is the
patients in the hospitals in the first quarter of 2010, say. When you
talk about this population specifically, you can say something like,
"The average age in the [most common type of] hospitals is 45.3 with
the standard error due to sampling equal to 1.7, while the average age
in the [other three types of hospitals] is 48.7, 51.3 and 50.6, with
the standard error due to sampling equal to zero." When you talk about
finite populations, it is kinda naive to expect that their means will
be exactly the same, so H_0: means are the same is not a very sensible
one.
But what you, most likely, want to do is to generalize your findings
somehow to all potential patients in these hospitals, or to a longer
period of time. You can do this it two ways.
First, you can try to stay within the formal sampling paradigm, and
say, "I sampled d days out of one year" (which is what you indicate in
the end of your email). This will be the second stage of your sampling
procedure, with the first stage being sampling the hospitals (and
obtaining fpc = 0 in that stage). If you list all the patients in the
hospital on these days, without sampling them within the hospital
(your writing is unclear about this: you said that you sampled them,
but did not describe if that was random sampling, or what), then your
weight should be 365.2422/d. FPCs are only applicable to simple random
sampling, where they reflect how joint probabilities of selection
necessary to compute unbiased variance estimates simplify. Your
design, however, is rather of systematic sampling (which is a special
case of cluster sampling): you have d consecutive days, which can
hardly be thought of as a simple random sample of days in the year.
This leads to another estimation disaster: your variances are not
estimable, since, from survey statistics perspective, you have a
sample of size 1 cluster (i.e., only 1 d.f.).
This leaves you, I believe, with one option remaining, and that is to
talk about a hypothetical super-population of all patients who could
have been admitted to these hospitals in an unspecified period of
time. You have not observed that sort of population, so you have to
make a leap of faith and say, "I believe that these d days are
representative of what's going on in these hospitals, in general".
That's a more natural mode to be in when you talk about say logistic
regression, and that's the reason Steve S asked about what you want to
do with the data. So... you are now saying that there is a data
generating process that (i) created patients, (ii) allocated them to
hospitals. I would still tend to think that you want to treat
hospitals as fixed, though. Then when you conduct your inference with
respect to that process (rather than to a specific finite population),
you have to say that your population is potentially infinite, and you
took a sample from this population. This makes fpc's equal to 1: fpc =
1 - n/N = 1 - n/infinity = 1. (Why did you have the square root in
your fpc formula above?) The weights still need to be modified, as is
it still more likely to find a patient in the hospital if you've
observed this hospital for a longer period of time. So 365/d is still
a good weight to use... or 1000/d can be used if your projected period
to which you want to generalize is 1000 days. It does not matter in
the end when you analyze the means; the scale of weights is only
important when you estimate the total (i.e., the total number of
patients admitted to each type of the hospitals, or the total costs of
care), in which case your weights should be linked to the specific
period of time over which you calculate your admissions or costs.
On Mon, Jul 5, 2010 at 10:17 AM, Louise Linsell
<[email protected]> wrote:
> This is the complete design for the partially stratified dataset:
>
> There are 4 types of hospital and (for example) we are testing the hypothesis that mean age is equal across hospital type.
>
> For the first type of hospital, we divided the 180 national units into 6 strata (North/South x large/medium/small size) and selected 37 units (with the probability of selection proportional to size within strata).
>
> For the other 3 types of hospital we selected all national units.
>
> We then sampled patients for d consecutive days, where d varied by unit.
>
>
> The commands we have used so far are:
>
> svyset hospid [pweight=weight], strata(strata) fpc(strata_fract)
> svy: mean age, over(hosptype)
>
> Where:
>
> hospid = hospital identifier 1...435
> weight = probability sampling weight (number of days recruited in unit/number of days recruited in units of same hospital type)
> strata = strata number 1...9 (1-6 for strata within 1st hospital type,7 for 2nd hospital type, 8 for 3rd hospital type and 9 for 4th hospital type)
> strata_frac = n/N - number of units selected in stratum/total number of units in stratum (=1 for last 3 types of hospital)
> age = patient age in years
> hosptype = type of hospital 1...4
>
> When this model is fitted we get zero estimates for the standard errors in the last 3 types of hospital.
> I think this is because strata_frac=1 for these hospitals, so the model thinks we have sampled the whole population,
> when in fact we have just sampled a number of consecutive days. I was thinking about specifying a second level of
> sampling - number of days sampled out of one whole year and setting fpc's for the secondary sampling units (days).
>
> LL
>
>
>
>
>
>>>> Steve Samuels <[email protected]> 05/07/2010 12:42 >>>
> Before we try to answer your question, please now tell us:
>
> 1. the complete design, including subsequent stages of sampling
> 2. the purposes of the analyses--descriptive? estimating regression
> coefficients? testing hypotheses?
>
> What -svyset- commands have you tried to issue so far?
>
> Steve
>
>
>
> On Mon, Jul 5, 2010 at 5:06 AM, Louise Linsell
> <[email protected]> wrote:
>> Thank you for suggestions. We have already tried defining 9 strata; 6 for the common type of hospital, for which we used stratified random sampling with 6 strata, and 1 stratum each for the other 3 types of hospital, for which we took all units.
>>
>> However, in the model we had to specify a finite population correction (FPC=sqrt(1-n/N)) as we sampled 28 out of 87 units for the most common type of hospital.
>>
>> Because we sampled ALL the units from the other 3 types of hospital we had to set the FPC to zero since n=N (which is specified as 1 in Stata as it requires you to specify n/N). This means that there are no variance estimates when we summarise any outcomes in the 3 less common types of hospital, because it thinks we have sampled the whole population within these hospitals (when in fact we took a consecutive number of patients over a period of 3 months).
>>
>> LL
>>
>>>>> Stas Kolenikov <[email protected]> 02/07/2010 20:36 >>>
>> If Louise sampled other 3 types lumping them together, then Steve's
>> recommendation is appropriate. If sampling was performed within each
>> of those remaining types, then the strata variable will have 6 (strata
>> in the most common type of hospitals) + 3 (other types of hospitals) =
>> 9 levels.
>>
>> On Fri, Jul 2, 2010 at 11:18 AM, Steve Samuels <[email protected]> wrote:
>>> Louise-- create a stratum variable with 7 values: 1-6 for the
>>> hospitals of the first type, and 7 for the other three types, and use
>>> that in the strata() option of -svyset-
>>>
>>> Steve
>>>
>>> On Fri, Jul 2, 2010 at 12:00 PM, Louise Linsell
>>> <[email protected]> wrote:
>>>> I have a dataset with 4 different types of hospital, and would like to compare binary outcomes between them using logistic regression. However for the first type (the most common), hospitals were divided into 6 strata (based on size and SES) and a random sample was taken from each strata. For the other 3 types of hospital we sampled all hospitals. My question is, how to use the svyset command when a different sampling strategy was used in one group?
>>>>
>>>> LL
>>>>
>>>>
>>>>
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/statalist/faq
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>
>>>
>>>
>>> --
>>> Steven Samuels
>>> [email protected]
>>> 18 Cantine's Island
>>> Saugerties NY 12477
>>> USA
>>> Voice: 845-246-0774
>>> Fax: 206-202-4783
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>
>>
>>
>> --
>> Stas Kolenikov, also found at http://stas.kolenikov.name
>> Small print: I use this email account for mailing lists only.
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
>>
>
>
>
> --
> Steven Samuels
> [email protected]
> 18 Cantine's Island
> Saugerties NY 12477
> USA
> Voice: 845-246-0774
> Fax: 206-202-4783
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/