Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Poststratification weighting, subpop, and missing values
From
Steve Samuels <[email protected]>
To
[email protected]
Subject
Re: st: Poststratification weighting, subpop, and missing values
Date
Thu, 27 Sep 2012 05:17:33 -0400
Ricky Ubee:
You saw an apparently paradoxical phenomenon: when you used a subpop()
option to exclude observations with missing values of your analysis variable,
the weighted population count and the number of observations reported by -svy: total-
increased increased and the standard error
This phenomenon is actually proper behavior. It has nothing to to do
with post-stratification. It has more to do with the difference between
using an -if- option and a subpop() option to subset analyses. Here is a
plain example.
. ***********CODE STARTS***************
. input y
y
1. .
2. 1
3. 3
4. 5
5. end
. svyset _n
[ results omitted]
. svy: total y // (1) Ignore missing y
Number of strata = 1 Number of obs = 3
Number of PSUs = 3 Population size = 3
Design df = 2
--------------------------------------------------------------
| Linearized
| Total Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
y | 9 3.464102 -5.904826 23.90483
--------------------------------------------------------------
. svy: total y if !missing(y) // (2) -if- expression
Number of strata = 1 Number of obs = 3
Number of PSUs = 3 Population size = 3
Design df = 2
--------------------------------------------------------------
| Linearized
| Total Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
y | 9 3.464102 -5.904826 23.90483
--------------------------------------------------------------
. svy, subpop(if !missing(y)): total y // (3)
Number of strata = 1 Number of obs = 4
Number of PSUs = 4 Population size = 4
Subpop. no. obs = 3
Subpop. size = 3
Design df = 3
--------------------------------------------------------------
| Linearized
| Total Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
y | 9 4.434712 -5.113231 23.11323
--------------------------------------------------------------
. ************CODE ENDS********************
.
In (1) & (2) the estimation results are identical, and the (weighted)
population and observation counts are equal to 3, the subpopulation
size. In (3), the standard error is larger and the population and
average counts are equal to the total sample size: 4.
In (1) if your analysis variable is missing, Stata ignores the observation.
This also happens in (2), which ignores observations not in the subpopulation.
In (3), the subpop() option tells Stata to consider observations *not* in
the subpopulation for purposes of computing standard errors. Thus the
the entire sample contributes to the analysis. For details, see any sampling text,
e.g. Levy & Lemeshow (2008).
Notes:
1. I've never seen a recommendation to consider observations with non-missing
values as a subpopulation. The focus is more on non-response bias, and possible
solutions include non-response weighting and imputation (though not for the outcome).
2. Combining subpopulations with post-strata and ordinary strata
can lead to bad results. Stratified & post-stratified proportions are
designed to match those of the entire population, and may not apply to
the subpopulation. See Levy & Lemeshow (2008), Section 6.4., p. 148.
3. I use the clause "if !missing(y)" above, rather than "if y ~=.", because
the latter would not capture missing values like ".a".
Reference: Levy, Paul S, and Stanley Lemeshow. 2008. Sampling of populations : methods and applications. Wiley series in survey methodology. Hoboken, N.J: Wiley.
Steve
> On Sep 26, 2012, at 9:25 AM, <[email protected]> <[email protected]> wrote:
>
> Hi everyone,
> I'm currently working on analyzing the results of a survey and have run into some strange results when using poststratification weights and the subpop modifier. An example is shown below, where we're simply totaling 2011 sales. The flag variable indicates the subpopulation we're interested in. When only limiting the population by flag, the command calculates the total over 2,624 PSUs, while when we try and further limit the population to those with flag equal to one and where total sales is not missing, it calculates over 2,639 PSUs. In the second command, STATA seems to be including the 15 missing values in its calculations. Also, the total for the more limited subpopulation is lower, which does not coincide with what we expect to happen when removing missing values and its effect on the background calculation of the adjusted weight.
>
> Could someone shed some light on why this is happening?
>
> Thank you,
> Ricky Ubee
>
>
>
>
> . svyset uniqueID [pweight=weight_prop], strata(strata2) singleunit(scaled) poststrata(type2) postweight(postwt4) fpc(N)
>
> pweight: weight_prop
> VCE: linearized
> Poststrata: type2
> Postweight: postwt4
> Single unit: scaled
> Strata 1: strata2
> SU 1: uniqueID
> FPC 1: N
>
>
> . svy, subpop(if flag==1): total TOT_SALES_11
> (running total on estimation sample)
>
> Survey: Total estimation
>
> Number of strata = 26 Number of obs = 2624
> Number of PSUs = 2624 Population size = 23794
> N. of poststrata = 16 Subpop. no. obs = 652
> Subpop. size = 5245.94
> Design df = 2598
>
> --------------------------------------------------------------
> | Linearized
> | Total Std. Err. [95% Conf. Interval]
> -------------+------------------------------------------------
> TOT_SALES_11 | 2.20e+12 2.77e+11 1.65e+12 2.74e+12
> --------------------------------------------------------------
> Note: 2 strata omitted because they contain no subpopulation
> members.
>
> . svy, subpop(if flag==1 & TOT_SALES_11~=.): total TOT_SALES_11
> (running total on estimation sample)
>
> Survey: Total estimation
>
> Number of strata = 26 Number of obs = 2639
> Number of PSUs = 2639 Population size = 23794
> N. of poststrata = 16 Subpop. no. obs = 652
> Subpop. size = 5222.38
> Design df = 2613
>
> --------------------------------------------------------------
> | Linearized
> | Total Std. Err. [95% Conf. Interval]
> -------------+------------------------------------------------
> TOT_SALES_11 | 2.18e+12 2.76e+11 1.64e+12 2.72e+12
> --------------------------------------------------------------
> Note: 2 strata omitted because they contain no subpopulation
> members.
>
>
> . count if flag==1 & TOT_SALES_11==.
> 15
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/