�ngel Rodr�guez Laso <[email protected]> has a follow-up question regarding
-svy: tabulate- with the -subpop()- option:
> Following with this, I have a query: If there are missing values in a
> variable and SEs and CIs for the valid values are wanted, how should
> one proceed? Are individuals with missing values dropped from the
> calculations of SEs if subpop is not used? I see four possibilities:
>
> 1) svy:tab variable */intuitive option
>
> 2) svy, subpop (valid values): tab variable */probably most accurate
>
> 3) svy if variable==valid values: tab variable */not recommended for svy
>
> 4) svy: tab variable, missing */ but then you don�t get proportions of
> valid values after excluding missing values
>
> In an example with a dichotomous variable with 5.7% missing values, I
> get exactly (up to three decimal figures) the same SEs, CIs and number
> of observations (n=11500, degrees of freedom=1255) with options 1, 2
> and 3, and slightly smaller SEs with option 4 (n=12190, df=1255).
In reviewing �ngel's results, we noticed that -svy: tabulate- is incorrectly
dropping out-of-subpop observations that contain missing values in the
variables of the varlist (Option 2 should be different from options 1 and 3).
This affects the variance values when primary sampling units are are dropped
because of missing values and could decrease the design degrees of freedom.
Both of these effects are very slight and inversely related to the number of
PSUs. We will correct this in the next Stata ado-file update.
In light of this, we'll address �ngel's observations using -svy: proportion-,
which is very similar to -svy: tabulate- and correctly deals with missing
values in out-of-subpop observations.
In the following we assume that the only variable with missing values is the
one we are tabulating. Here is a simple example that illustrates the
differences among the 4 options delineated by �ngel.
. sysuse auto
. svyset _n
. * 1
. svy: prop rep
. est store noopts
. * 2
. gen valid = !missing(rep)
. svy, subpop(valid): prop rep
. est store subpop
. * 3
. svy: prop rep if valid
. est store withif
. * 4
. svy: prop rep, missing
. est store missing
. est table _all, b se
***** BEGIN: final output from above illustrative example
. est table _all, b se
------------------------------------------------------------------
Variable | noopts subpop withif missing
-------------+----------------------------------------------------
1 | .02898551 .02898551 .02898551 .02702703
| .02034459 .02033449 .02034459 .01897965
2 | .11594203 .11594203 .11594203 .10810811
| .03882454 .03880527 .03882454 .03634325
3 | .43478261 .43478261 .43478261 .40540541
| .0601159 .06008606 .0601159 .05746373
4 | .26086957 .26086957 .26086957 .24324324
| .05324978 .05322334 .05324978 .05021542
5 | .15942029 .15942029 .15942029 .14864865
| .04439221 .04437017 .04439221 .04163643
_prop_6 | .06756757
| .02937761
------------------------------------------------------------------
legend: b/se
***** END:
Summary of options (illustrated by above example using the auto data):
- Options 1 (noopts) and 3 (withif) are equivalent. Stata's -svy- commands
drop within-subpop observations containing missing values. In this case,
the "subpop" is the entire population, and option 3 merely explicitly
excludes the observations that option 1 dropped because of missing values.
- Option 2 (subpop) differs by treating the observations where the tabulated
variables contain missing values as out-of-subpop. Thus we are defining the
subpop as the collection of individuals in the population for which we are
able to collect information on the tabulated variable. While this results
in the same point estimates for any survey design, the variance estimates
can vary depending upon the number of PSU that are dropped by options 1 and
3.
- Option 4 (missing) merely treats the missing values as a separate category,
potentially biasing the point estimates and standard errors downward (toward
zero). The -missing- option should only be used in cases where the missing
values mean something like "not applicable" rather than "we couldn't get a
value from the survey participant".
The option to choose is largely dependent on the reason for missing values in
the data.
--Jeff
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/