[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Postestimation puzzle(s)

From	Philip Burgess <[email protected]>
To	[email protected]
Subject	Re: st: Postestimation puzzle(s)
Date	Sat, 28 Nov 2009 15:14:57 +1000
Jeff;

First of all, thank you for such an exhaustive analysis to my query. I
think the problem is solved – I’ll explain further below!  But I also
want to thank you for stepping through the process with the detailed
exposition and the example do file code – I could easily replicate
this and stepping through it helped me understand the various issues –
this was extremely helpful and informative.

I had at least two problems:

1.	the population size of N = 1; and
2.	the sampling frame denominator.

I resolved the first issue by setting my st_wt variable to counts
rather than proportions. Although, it is worth noting, that nothing
changed other than the reported Population size.

The second issue was resolved more obtusely.

First, I had (correctly) developed stratification weights for age-sex
based on the entire sample aged 20-69 years inclusive. I believe this
is the right thing to do.

Second, the example I provided was based on an analysis of a subset of
the overall data – namely those who had a disorder and consulted a
psychologist. I now believe that the filtering of the data in this way
is not a true test of age-sex standardization. So, to verify your
explanation, I tested by not filtering the data. I got the results you
had hypothesized: (i) the proportions for 2007 did not change; (ii)
the age-sex standardized proportions for 1997 did change; and (iii)
the SEs did change for both 1997 & 2007 when I ‘directly
standardized’.

So, in summary, all is good and makes entire sense.

I’m left with one issue – whether I should or could declare the
age-sex stratification and weights factors as poststratification
specifications in the svyset. Admittedly, I’m not across the
conceptual issues with this and may be entirely naively going down the
wrong path. I guess I can ‘test’ these effects for prevalence
estimates; and I guess too I have two options for controlling for this
in subsequent analyses:

1.	declare the survey settings with poststratification factors – not
sure if this is the right thing to do; and/or
2.	include age-sex strata in say any logistic regression models.

Not sure – but regardless, I have made significant progress given your help.

Thank you again for the time and the effort to prepare such a
considered analysis and exposition of the query I raised.

Many thanks;

Philip


On Thu, Nov 26, 2009 at 4:11 AM, Jeff Pitblado, StataCorp LP
<[email protected]> wrote:
> Philip Burgess <[email protected]> is working with data from two
> survey, and wants to use direct standardization in order to make comparisons
> between the two.
>
> Philip's original posting (corrected version) is included following my
> signature.
>
> I'm posting my conclusion first so that it doesn't get lost in my exposition
> of Philip's data analysis experiment.
>
> Conclusion
> ----------
>
> Poststratification does not line up with Philip's expectations, but direct
> standardization does.
>
> It appears that Philip needs to verify that he generated the correct
> standardizing weights.
>
> Exposition
> ----------
>
> Starting with the auto dataset, I built a simulated dataset to illustrate
> Philip's data analysis experiment and verify his expectations.
>
> Here are the relevant variables:
>
> pw            - survey sampling weight variable
> foreign       - indicator variable of interest
> group         - survey group id: 0 old survey, 1 new survey
>
>        . set seed 12345
>        . sysuse auto
>        . keep if !missing(rep78) & rep > 2
>        . gen group = uniform() < .5
>        . gen pw = 2/uniform()
>
> Let's generate a standardizing weight variable for group == 1 (new survey),
> we'll use 'rep78' to identify the standard strata.
>
>        . egen repwt = total(pw) if group == 1, by(rep78)
>
> Notice that I summed the sampling weights within the standard strata, and only
> within the observations for the new survey.
>
> Now I apply these values to the observations for the old survey.
>
>        . sort rep78 group
>        . by rep78: replace repwt = repwt[_N]
>
> Now we can replicate Philip's direct standardization experiment:
>
>        . svyset [pw=pw]
>        . svy: proportion for, over(group)
>        . svy: proportion for, over(group) stdize(rep78) stdweight(repwt)
>
> The results from the above two estimation commands follow.  Notice that in the
> results for group == 1, the estimated proportions are the same between these
> two commands; this is as Philip expected but didn't experience with his own
> experiment.  The standard errors differ due to the fact that direct
> standardization produces different score values from the non-standardized
> analysis, for more details see 'The standardized mean estimator' subsection of
> the 'Methods and formulas' section of '[R] mean' (proportions of means of
> indicator variables).
>
> ***** BEGIN:
> . svy: proportion for, over(group)
> (running proportion on estimation sample)
>
> Survey: Proportion estimation
>
> Number of strata =       1          Number of obs    =      59
> Number of PSUs   =      59          Population size  = 619.493
>                                    Design df        =      58
>
>     Domestic: foreign = Domestic
>      Foreign: foreign = Foreign
>
>            0: group = 0
>            1: group = 1
>
> --------------------------------------------------------------
>             |             Linearized
>        Over | Proportion   Std. Err.     [95% Conf. Interval]
> -------------+------------------------------------------------
> Domestic     |
>           0 |   .7560045   .1155163      .5247735    .9872354
>           1 |   .8139268   .0952956      .6231719    1.004682
> -------------+------------------------------------------------
> Foreign      |
>           0 |   .2439955   .1155163      .0127646    .4752265
>           1 |   .1860732   .0952956     -.0046817    .3768281
> --------------------------------------------------------------
>
> . svy: proportion for, over(group) stdize(rep78) stdweight(repwt)
> (running proportion on estimation sample)
>
> Survey: Proportion estimation
>
> Number of strata =       1          Number of obs    =      59
> Number of PSUs   =      59          Population size  = 619.493
> N. of std strata =       3          Design df        =      58
>
>     Domestic: foreign = Domestic
>      Foreign: foreign = Foreign
>
>            0: group = 0
>            1: group = 1
>
> --------------------------------------------------------------
>             |             Linearized
>        Over | Proportion   Std. Err.     [95% Conf. Interval]
> -------------+------------------------------------------------
> Domestic     |
>           0 |   .8970116   .0396547      .8176342    .9763891
>           1 |   .8139268   .0459222      .7220036      .90585
> -------------+------------------------------------------------
> Foreign      |
>           0 |   .1029884   .0396547      .0236109    .1823658
>           1 |   .1860732   .0459222        .09415    .2779964
> --------------------------------------------------------------
> ***** END:
>
> Philip originally used poststratification in his experiment.  For my
> experiment, this amounts to the following analysis:
>
>        . svyset [pw=pw], poststrata(rep78) postweight(repwt)
>        . svy: proportion for, over(group)
>
> Philip noticed that the estimated population size was different from the
> original non-poststratified results.  He reported that the population size was
> estimated to be 1.  I must admit that I initially assumed Philip meant 1
> million since his original analysis estimated a pop size of 1.39 million.
>
> For my experiment, the estimated population size is going to be the sum of the
> raw sampling weights (pw) for group == 1, since this is how we generated the
> standard weights.
>
> Poststratification is a weight adjustment procedure.  As the following results
> illustrate, this is very different from direct standardization.
>
> ***** BEGIN:
> . sum pw if group, mean
>
> . di r(sum)
> 361.081
>
> . svy: proportion for, over(group)
> (running proportion on estimation sample)
>
> Survey: Proportion estimation
>
> Number of strata =       1          Number of obs    =      59
> Number of PSUs   =      59          Population size  = 361.081
> N. of poststrata =       3          Design df        =      58
>
>     Domestic: foreign = Domestic
>      Foreign: foreign = Foreign
>
>            0: group = 0
>            1: group = 1
>
> --------------------------------------------------------------
>             |             Linearized
>        Over | Proportion   Std. Err.     [95% Conf. Interval]
> -------------+------------------------------------------------
> Domestic     |
>           0 |   .8497911   .0623371      .7250098    .9745724
>           1 |   .8608167   .0603897      .7399335       .9817
> -------------+------------------------------------------------
> Foreign      |
>           0 |   .1502089   .0623371      .0254276    .2749902
>           1 |   .1391833   .0603897         .0183    .2600665
> --------------------------------------------------------------
> ***** END:
>
> Based on this, and thinking about Philip's estimated population size of 1, I
> believe that Philip may have generated the standardizing relative frequencies
> instead of standardizing count. While Stata works properly with counts or
> relative frequencies in the standard weight variable, Stata's
> poststratification features assume counts.
>
> Here is the contents of the do-file I used to produce the above results:
>
> ***** BEGIN:
> * Simulated data:
> * pw            - survey sampling weight variable
> * foreign       - indicator variable of interest
> * group         - survey group id: 0 old survey, 1 new survey
>
> set seed 12345
> sysuse auto
> keep if !missing(rep78) & rep > 2
> gen group = uniform() < .5
> gen pw = 2/uniform()
>
> * Let's generate a standardizing weight variable for group == 1 (new survey):
> * rep78         - standard strata
> egen repwt = total(pw) if group, by(rep78)
>
> * Now apply the standard weight values to the old survey observations:
> sort rep78 group
> by rep78: replace repwt = repwt[_N]
>
> * Now we can replicate Philip's experiment:
> svyset [pw=pw]
> svy: proportion for, over(group)
> svy: proportion for, over(group) stdize(rep78) stdweight(repwt)
>
> * The estimated population size is going to be the sum of the raw sampling
> * weights (pw) for group == 1:
> svyset [pw=pw], poststrata(rep78) postweight(repwt)
> sum pw if group, mean
> di r(sum)
> svy: proportion for, over(group)
>
> ***** END:
>
> --Jeff
> [email protected]
>
>> I have 2 surveys, 1997 & 2007, with complex survey designs and
>> available for analysis with Jackknife replicate weights. The surveys
>> were more or less equivalent in their design: nationally
>> representative random (independent) samples (without replacement).
>>
>> A possible issue is that the 1997 survey included persons aged 18
>> years or older (no upper limit); the 2007 survey included persons aged
>> 16 – 85 years inclusive. As such, there could be issues about
>> combining the two surveys given that the replicate weights are
>> calibrated to different population structures – not sure there is any
>> way around this if this is an explanation.
>>
>> I want to age-sex standardize the 1997 data to the 2007 population
>> structure. To do so, I first limited the two samples to respondents
>> aged 20-69 years inclusive – to get like-with-like comparisons. I then
>> created a new variable indicating the age-sex strata (10 5-year age
>> bands x 2 sex = 20 strata – variable name, st_agesex). I then
>> estimated the 2007 population size for each of these 20 age-sex strata
>> – variable name, st_wt).
>>
>> I do several runs through the data.
>>
>> The first specifies the complex survey design:
>>
>> . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1))
>> vce(jackknife) mse
>>
>> Stata output reports: Number of strata = 1; Population size = 1.39
>> million. All this makes sense.
>>
>> I then estimate proportions who consulted a psychologist for mental
>> health problems in the last 12-months (mhpsyco12: code 0/1) over the
>> two surveys (nsmhwb, 0 = 1997; 1 = 2007)
>>
>> . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb)
>>
>> These give estimates for the unadjusted populations:
>>
>> 1997 - 17.0% (SE 1.3%);
>> 2007 – 37.3% (SE 2.5%).
>>
>> All good so far.
>>
>> The second pass through the data declares the complex survey design
>> with poststratification specification strata and weights:
>>
>> . quietly svyset [pweight=mhsfinwt], poststrata(st_agesex) postweight(st_wt) ///
>>  jkrweight(wpm*, multiplier(1)) vce(jackknife) mse
>>
>> Stata output reports Number of strata = 1; N. of poststrata = 20, and
>> Stata also reports a Population size of 1. I don’t understand the
>> Population size parameter – why isn’t it 1.39 mill per above?
>>
>> I then estimate proportions who consulted a psychologist for mental
>> health problems in the last 12-months adjusted for the age-sex stratum
>> factors
>>
>> . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb)
>>
>> These give estimates for the ‘adjusted’ age-sex standardized populations:
>>
>> 1997 - 15.7% (SE 1.4%);
>> 2007 – 37.1% (SE 2.6%).
>>
>> I expected the 1997 estimate to be reduced given age-sex adjustment –
>> this is the case. But I do not understand why the 2007 ‘adjusted’
>> estimates vary at all from the 2007 ‘unadjusted’ unadjusted estimates.
>>
>> Finally, to try and unravel this matter, I resorted to the original
>> complex survey design declaration:
>>
>> . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1))
>> vce(jackknife) mse
>>
>> Stata output reports Number of strata = 1; N. of std strata = 20 –
>> both of these make sense. Stata also reports a Population size of 1.39
>> million. All of these make sense.
>>
>> I then tried to ‘directly standardize’ the proportions:
>>
>> . svy jackknife, nodots : proportion mhpsyo12, stdize(st_agesex)
>> stdweight(st_wt) over(nsmhwb)
>>
>> These give estimates for the ‘adjusted’ age-sex standardized populations:
>>
>> 1997 - 15.6% (SE 1.4%);
>> 2007 – 37.8% (SE 3.0%).
>>
>> So, I’m confused. I understand why the 1997 estimates vary given
>> age-sex adjustment (although a bit confused why the results differ
>> between poststratification and direct standardization); I have more
>> trouble understanding the varying estimates for 2007.
>>
>> I’m struggling to understand all of this and welcome any ideas! It’s
>> likely I do not properly understand the postsratification processes.
>>
>> I’m using Stata 11.0, born 21 October 2009.
>>
>> Any thoughts or ideas most grateful!
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- Re: st: Postestimation puzzle(s)
  - From: Philip Burgess <[email protected]>
References:
- Re: st: Postestimation puzzle(s)
  - From: [email protected] (Jeff Pitblado, StataCorp LP)
Prev by Date: Re: st: Constructing assignment var for Spatial RDD: For each id in region 1 need shortest distance to an id in region 0
Next by Date: Re: RE: st: RE: How to show all the letters in the variable name when they are long?
Previous by thread: Re: st: Postestimation puzzle(s)
Next by thread: Re: st: Postestimation puzzle(s)
Index(es):
- Date
- Thread