Jeff;
First of all, thank you for such an exhaustive analysis to my query. I
think the problem is solved – I’ll explain further below! But I also
want to thank you for stepping through the process with the detailed
exposition and the example do file code – I could easily replicate
this and stepping through it helped me understand the various issues –
this was extremely helpful and informative.
I had at least two problems:
1. the population size of N = 1; and
2. the sampling frame denominator.
I resolved the first issue by setting my st_wt variable to counts
rather than proportions. Although, it is worth noting, that nothing
changed other than the reported Population size.
The second issue was resolved more obtusely.
First, I had (correctly) developed stratification weights for age-sex
based on the entire sample aged 20-69 years inclusive. I believe this
is the right thing to do.
Second, the example I provided was based on an analysis of a subset of
the overall data – namely those who had a disorder and consulted a
psychologist. I now believe that the filtering of the data in this way
is not a true test of age-sex standardization. So, to verify your
explanation, I tested by not filtering the data. I got the results you
had hypothesized: (i) the proportions for 2007 did not change; (ii)
the age-sex standardized proportions for 1997 did change; and (iii)
the SEs did change for both 1997 & 2007 when I ‘directly
standardized’.
So, in summary, all is good and makes entire sense.
I’m left with one issue – whether I should or could declare the
age-sex stratification and weights factors as poststratification
specifications in the svyset. Admittedly, I’m not across the
conceptual issues with this and may be entirely naively going down the
wrong path. I guess I can ‘test’ these effects for prevalence
estimates; and I guess too I have two options for controlling for this
in subsequent analyses:
1. declare the survey settings with poststratification factors – not
sure if this is the right thing to do; and/or
2. include age-sex strata in say any logistic regression models.
Not sure – but regardless, I have made significant progress given your help.
Thank you again for the time and the effort to prepare such a
considered analysis and exposition of the query I raised.
Many thanks;
Philip
On Thu, Nov 26, 2009 at 4:11 AM, Jeff Pitblado, StataCorp LP
<[email protected]> wrote:
> Philip Burgess <[email protected]> is working with data from two
> survey, and wants to use direct standardization in order to make comparisons
> between the two.
>
> Philip's original posting (corrected version) is included following my
> signature.
>
> I'm posting my conclusion first so that it doesn't get lost in my exposition
> of Philip's data analysis experiment.
>
> Conclusion
> ----------
>
> Poststratification does not line up with Philip's expectations, but direct
> standardization does.
>
> It appears that Philip needs to verify that he generated the correct
> standardizing weights.
>
> Exposition
> ----------
>
> Starting with the auto dataset, I built a simulated dataset to illustrate
> Philip's data analysis experiment and verify his expectations.
>
> Here are the relevant variables:
>
> pw - survey sampling weight variable
> foreign - indicator variable of interest
> group - survey group id: 0 old survey, 1 new survey
>
> . set seed 12345
> . sysuse auto
> . keep if !missing(rep78) & rep > 2
> . gen group = uniform() < .5
> . gen pw = 2/uniform()
>
> Let's generate a standardizing weight variable for group == 1 (new survey),
> we'll use 'rep78' to identify the standard strata.
>
> . egen repwt = total(pw) if group == 1, by(rep78)
>
> Notice that I summed the sampling weights within the standard strata, and only
> within the observations for the new survey.
>
> Now I apply these values to the observations for the old survey.
>
> . sort rep78 group
> . by rep78: replace repwt = repwt[_N]
>
> Now we can replicate Philip's direct standardization experiment:
>
> . svyset [pw=pw]
> . svy: proportion for, over(group)
> . svy: proportion for, over(group) stdize(rep78) stdweight(repwt)
>
> The results from the above two estimation commands follow. Notice that in the
> results for group == 1, the estimated proportions are the same between these
> two commands; this is as Philip expected but didn't experience with his own
> experiment. The standard errors differ due to the fact that direct
> standardization produces different score values from the non-standardized
> analysis, for more details see 'The standardized mean estimator' subsection of
> the 'Methods and formulas' section of '[R] mean' (proportions of means of
> indicator variables).
>
> ***** BEGIN:
> . svy: proportion for, over(group)
> (running proportion on estimation sample)
>
> Survey: Proportion estimation
>
> Number of strata = 1 Number of obs = 59
> Number of PSUs = 59 Population size = 619.493
> Design df = 58
>
> Domestic: foreign = Domestic
> Foreign: foreign = Foreign
>
> 0: group = 0
> 1: group = 1
>
> --------------------------------------------------------------
> | Linearized
> Over | Proportion Std. Err. [95% Conf. Interval]
> -------------+------------------------------------------------
> Domestic |
> 0 | .7560045 .1155163 .5247735 .9872354
> 1 | .8139268 .0952956 .6231719 1.004682
> -------------+------------------------------------------------
> Foreign |
> 0 | .2439955 .1155163 .0127646 .4752265
> 1 | .1860732 .0952956 -.0046817 .3768281
> --------------------------------------------------------------
>
> . svy: proportion for, over(group) stdize(rep78) stdweight(repwt)
> (running proportion on estimation sample)
>
> Survey: Proportion estimation
>
> Number of strata = 1 Number of obs = 59
> Number of PSUs = 59 Population size = 619.493
> N. of std strata = 3 Design df = 58
>
> Domestic: foreign = Domestic
> Foreign: foreign = Foreign
>
> 0: group = 0
> 1: group = 1
>
> --------------------------------------------------------------
> | Linearized
> Over | Proportion Std. Err. [95% Conf. Interval]
> -------------+------------------------------------------------
> Domestic |
> 0 | .8970116 .0396547 .8176342 .9763891
> 1 | .8139268 .0459222 .7220036 .90585
> -------------+------------------------------------------------
> Foreign |
> 0 | .1029884 .0396547 .0236109 .1823658
> 1 | .1860732 .0459222 .09415 .2779964
> --------------------------------------------------------------
> ***** END:
>
> Philip originally used poststratification in his experiment. For my
> experiment, this amounts to the following analysis:
>
> . svyset [pw=pw], poststrata(rep78) postweight(repwt)
> . svy: proportion for, over(group)
>
> Philip noticed that the estimated population size was different from the
> original non-poststratified results. He reported that the population size was
> estimated to be 1. I must admit that I initially assumed Philip meant 1
> million since his original analysis estimated a pop size of 1.39 million.
>
> For my experiment, the estimated population size is going to be the sum of the
> raw sampling weights (pw) for group == 1, since this is how we generated the
> standard weights.
>
> Poststratification is a weight adjustment procedure. As the following results
> illustrate, this is very different from direct standardization.
>
> ***** BEGIN:
> . sum pw if group, mean
>
> . di r(sum)
> 361.081
>
> . svy: proportion for, over(group)
> (running proportion on estimation sample)
>
> Survey: Proportion estimation
>
> Number of strata = 1 Number of obs = 59
> Number of PSUs = 59 Population size = 361.081
> N. of poststrata = 3 Design df = 58
>
> Domestic: foreign = Domestic
> Foreign: foreign = Foreign
>
> 0: group = 0
> 1: group = 1
>
> --------------------------------------------------------------
> | Linearized
> Over | Proportion Std. Err. [95% Conf. Interval]
> -------------+------------------------------------------------
> Domestic |
> 0 | .8497911 .0623371 .7250098 .9745724
> 1 | .8608167 .0603897 .7399335 .9817
> -------------+------------------------------------------------
> Foreign |
> 0 | .1502089 .0623371 .0254276 .2749902
> 1 | .1391833 .0603897 .0183 .2600665
> --------------------------------------------------------------
> ***** END:
>
> Based on this, and thinking about Philip's estimated population size of 1, I
> believe that Philip may have generated the standardizing relative frequencies
> instead of standardizing count. While Stata works properly with counts or
> relative frequencies in the standard weight variable, Stata's
> poststratification features assume counts.
>
> Here is the contents of the do-file I used to produce the above results:
>
> ***** BEGIN:
> * Simulated data:
> * pw - survey sampling weight variable
> * foreign - indicator variable of interest
> * group - survey group id: 0 old survey, 1 new survey
>
> set seed 12345
> sysuse auto
> keep if !missing(rep78) & rep > 2
> gen group = uniform() < .5
> gen pw = 2/uniform()
>
> * Let's generate a standardizing weight variable for group == 1 (new survey):
> * rep78 - standard strata
> egen repwt = total(pw) if group, by(rep78)
>
> * Now apply the standard weight values to the old survey observations:
> sort rep78 group
> by rep78: replace repwt = repwt[_N]
>
> * Now we can replicate Philip's experiment:
> svyset [pw=pw]
> svy: proportion for, over(group)
> svy: proportion for, over(group) stdize(rep78) stdweight(repwt)
>
> * The estimated population size is going to be the sum of the raw sampling
> * weights (pw) for group == 1:
> svyset [pw=pw], poststrata(rep78) postweight(repwt)
> sum pw if group, mean
> di r(sum)
> svy: proportion for, over(group)
>
> ***** END:
>
> --Jeff
> [email protected]
>
>> I have 2 surveys, 1997 & 2007, with complex survey designs and
>> available for analysis with Jackknife replicate weights. The surveys
>> were more or less equivalent in their design: nationally
>> representative random (independent) samples (without replacement).
>>
>> A possible issue is that the 1997 survey included persons aged 18
>> years or older (no upper limit); the 2007 survey included persons aged
>> 16 – 85 years inclusive. As such, there could be issues about
>> combining the two surveys given that the replicate weights are
>> calibrated to different population structures – not sure there is any
>> way around this if this is an explanation.
>>
>> I want to age-sex standardize the 1997 data to the 2007 population
>> structure. To do so, I first limited the two samples to respondents
>> aged 20-69 years inclusive – to get like-with-like comparisons. I then
>> created a new variable indicating the age-sex strata (10 5-year age
>> bands x 2 sex = 20 strata – variable name, st_agesex). I then
>> estimated the 2007 population size for each of these 20 age-sex strata
>> – variable name, st_wt).
>>
>> I do several runs through the data.
>>
>> The first specifies the complex survey design:
>>
>> . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1))
>> vce(jackknife) mse
>>
>> Stata output reports: Number of strata = 1; Population size = 1.39
>> million. All this makes sense.
>>
>> I then estimate proportions who consulted a psychologist for mental
>> health problems in the last 12-months (mhpsyco12: code 0/1) over the
>> two surveys (nsmhwb, 0 = 1997; 1 = 2007)
>>
>> . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb)
>>
>> These give estimates for the unadjusted populations:
>>
>> 1997 - 17.0% (SE 1.3%);
>> 2007 – 37.3% (SE 2.5%).
>>
>> All good so far.
>>
>> The second pass through the data declares the complex survey design
>> with poststratification specification strata and weights:
>>
>> . quietly svyset [pweight=mhsfinwt], poststrata(st_agesex) postweight(st_wt) ///
>> jkrweight(wpm*, multiplier(1)) vce(jackknife) mse
>>
>> Stata output reports Number of strata = 1; N. of poststrata = 20, and
>> Stata also reports a Population size of 1. I don’t understand the
>> Population size parameter – why isn’t it 1.39 mill per above?
>>
>> I then estimate proportions who consulted a psychologist for mental
>> health problems in the last 12-months adjusted for the age-sex stratum
>> factors
>>
>> . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb)
>>
>> These give estimates for the ‘adjusted’ age-sex standardized populations:
>>
>> 1997 - 15.7% (SE 1.4%);
>> 2007 – 37.1% (SE 2.6%).
>>
>> I expected the 1997 estimate to be reduced given age-sex adjustment –
>> this is the case. But I do not understand why the 2007 ‘adjusted’
>> estimates vary at all from the 2007 ‘unadjusted’ unadjusted estimates.
>>
>> Finally, to try and unravel this matter, I resorted to the original
>> complex survey design declaration:
>>
>> . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1))
>> vce(jackknife) mse
>>
>> Stata output reports Number of strata = 1; N. of std strata = 20 –
>> both of these make sense. Stata also reports a Population size of 1.39
>> million. All of these make sense.
>>
>> I then tried to ‘directly standardize’ the proportions:
>>
>> . svy jackknife, nodots : proportion mhpsyo12, stdize(st_agesex)
>> stdweight(st_wt) over(nsmhwb)
>>
>> These give estimates for the ‘adjusted’ age-sex standardized populations:
>>
>> 1997 - 15.6% (SE 1.4%);
>> 2007 – 37.8% (SE 3.0%).
>>
>> So, I’m confused. I understand why the 1997 estimates vary given
>> age-sex adjustment (although a bit confused why the results differ
>> between poststratification and direct standardization); I have more
>> trouble understanding the varying estimates for 2007.
>>
>> I’m struggling to understand all of this and welcome any ideas! It’s
>> likely I do not properly understand the postsratification processes.
>>
>> I’m using Stata 11.0, born 21 October 2009.
>>
>> Any thoughts or ideas most grateful!
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/