If I've understood well, the exclusion of individuals with missing
values from the calculation of variance estimates in options 1 and 3
is like for example dropping men when you need estimates only for
women (using 'if' instead of 'subpop') what is incorrect in survey
analysis.
Therefore, when calculating variances for valid values in a survey
variable (the rest of the values being missing either because the
respondent didn't answer, didn�t know the answer or the question was
not applicable) option 2 (subpop(valid values)) should be used,
because it takes into account all individuals (with valid and missing
values) in the calculation. Is that correct?
Many thanks.
�ngel Rodr�guez Laso
2008/8/20, Jeff Pitblado, StataCorp LP <[email protected]>:
> �ngel Rodr�guez Laso <[email protected]> has a follow-up question regarding
> -svy: tabulate- with the -subpop()- option:
>
> > Following with this, I have a query: If there are missing values in a
> > variable and SEs and CIs for the valid values are wanted, how should
> > one proceed? Are individuals with missing values dropped from the
> > calculations of SEs if subpop is not used? I see four possibilities:
> >
> > 1) svy:tab variable */intuitive option
> >
> > 2) svy, subpop (valid values): tab variable */probably most accurate
> >
> > 3) svy if variable==valid values: tab variable */not recommended for svy
> >
> > 4) svy: tab variable, missing */ but then you don�t get proportions of
> > valid values after excluding missing values
> >
> > In an example with a dichotomous variable with 5.7% missing values, I
> > get exactly (up to three decimal figures) the same SEs, CIs and number
> > of observations (n=11500, degrees of freedom=1255) with options 1, 2
> > and 3, and slightly smaller SEs with option 4 (n=12190, df=1255).
>
> In reviewing �ngel's results, we noticed that -svy: tabulate- is incorrectly
> dropping out-of-subpop observations that contain missing values in the
> variables of the varlist (Option 2 should be different from options 1 and 3).
> This affects the variance values when primary sampling units are are dropped
> because of missing values and could decrease the design degrees of freedom.
> Both of these effects are very slight and inversely related to the number of
> PSUs. We will correct this in the next Stata ado-file update.
>
> In light of this, we'll address �ngel's observations using -svy: proportion-,
> which is very similar to -svy: tabulate- and correctly deals with missing
> values in out-of-subpop observations.
>
> In the following we assume that the only variable with missing values is the
> one we are tabulating. Here is a simple example that illustrates the
> differences among the 4 options delineated by �ngel.
>
> . sysuse auto
> . svyset _n
> . * 1
> . svy: prop rep
> . est store noopts
> . * 2
> . gen valid = !missing(rep)
> . svy, subpop(valid): prop rep
> . est store subpop
> . * 3
> . svy: prop rep if valid
> . est store withif
> . * 4
> . svy: prop rep, missing
> . est store missing
> . est table _all, b se
>
> ***** BEGIN: final output from above illustrative example
> . est table _all, b se
>
> ------------------------------------------------------------------
> Variable | noopts subpop withif missing
> -------------+----------------------------------------------------
> 1 | .02898551 .02898551 .02898551 .02702703
> | .02034459 .02033449 .02034459 .01897965
> 2 | .11594203 .11594203 .11594203 .10810811
> | .03882454 .03880527 .03882454 .03634325
> 3 | .43478261 .43478261 .43478261 .40540541
> | .0601159 .06008606 .0601159 .05746373
> 4 | .26086957 .26086957 .26086957 .24324324
> | .05324978 .05322334 .05324978 .05021542
> 5 | .15942029 .15942029 .15942029 .14864865
> | .04439221 .04437017 .04439221 .04163643
> _prop_6 | .06756757
> | .02937761
> ------------------------------------------------------------------
> legend: b/se
> ***** END:
>
> Summary of options (illustrated by above example using the auto data):
>
> - Options 1 (noopts) and 3 (withif) are equivalent. Stata's -svy- commands
> drop within-subpop observations containing missing values. In this case,
> the "subpop" is the entire population, and option 3 merely explicitly
> excludes the observations that option 1 dropped because of missing values.
>
> - Option 2 (subpop) differs by treating the observations where the tabulated
> variables contain missing values as out-of-subpop. Thus we are defining the
> subpop as the collection of individuals in the population for which we are
> able to collect information on the tabulated variable. While this results
> in the same point estimates for any survey design, the variance estimates
> can vary depending upon the number of PSU that are dropped by options 1 and
> 3.
>
> - Option 4 (missing) merely treats the missing values as a separate category,
> potentially biasing the point estimates and standard errors downward (toward
> zero). The -missing- option should only be used in cases where the missing
> values mean something like "not applicable" rather than "we couldn't get a
> value from the survey participant".
>
> The option to choose is largely dependent on the reason for missing values in
> the data.
>
> --Jeff
> [email protected]
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/