Title | Missing standard error because of stratum with single sampling unit | |
Author | Mia Lv, StataCorp |
By default, Stata's survey estimation commands report missing standard errors when they encounter a stratum with a singleton PSU. Here is an example:
. use http://www.stata-press.com/data/r15/nhanes2b, clear . svyset psuid [pweight=finalwgt], strata(stratid) (output omitted) . svy: mean hdresult (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 8,720 Number of PSUs = 60 Population size = 98,725,345 Design df = 29
Linearized | ||
Mean std. err. [95% conf. interval] | ||
hdresult | 49.67141 . . . | |
When there is only one PSU within a stratum, there is insufficient information to compute an estimate of that stratum's variance. Therefore, it is impossible to compute the variance of an estimated parameter when the data are from a stratified clustered design. There are two different solutions. The first solution is to reassign each stratum with a singleton PSU to another appropriately chosen stratum. To use this method, we must identify the strata with singleton PSUs first.
After setting our survey characteristics with svyset, we can use the svydescribe command to identify the strata with singleton PSUs. Those strata will be marked with an asterisk in the output. Let's look at the following dataset:
clear input stratid psuid age hdresult finalwgt 1 1 68 40 9687 1 1 54 53 36028 2 1 26 35 26896 2 1 24 48 8213 2 2 68 43 3316 2 2 61 65 8475 3 1 25 80 10900 3 1 27 93 7619 3 2 24 38 22584 3 2 64 72 2875 end svyset psuid [pweight=finalwgt], strata(stratid) save data1.dta
We run svydescribe and get the following output:
. svydescribe Survey: Describing stage 1 sampling units Sampling weights: finalwgt VCE: linearized Single unit: missing Strata 1: stratid Sampling unit 1: psuid FPC 1: <zero>
Number of obs per unit Stratum # units # obs Min Mean Max | ||
1 1* 2 2 2.0 2 2 2 4 2 2.0 2 3 2 4 2 2.0 2 | ||
3 5 10 2 2.0 2 |
Here we can see that the stratum 1 has a singleton PSU.
We perform an estimation with survey data, the problem of stratum with a singleton PSU can arise, even if all strata in the dataset have multiple PSUs. This happens when some observations are dropped because of missing values.
Let us look at the following survey data. In this example, when we try to estimate the mean of variable hdresult, the standard errors are missing, and a note on the output tells us that this is caused by a stratum with a single PSU:
clear input stratid psuid age hdresult finalwgt 1 1 68 40 9687 1 1 54 53 36028 1 2 28 . 9356 1 2 35 . 10265 2 1 26 35 26896 2 1 24 48 8213 2 2 68 43 3316 2 2 61 65 8475 3 1 25 80 10900 3 1 27 93 7619 3 2 24 38 22584 3 2 64 72 2875 end svyset psuid [pweight=finalwgt], strata(stratid)
. svy: mean hdresult (running mean on estimation sample) Survey: Mean estimation Number of strata = 3 Number of obs = 10 Number of PSUs = 5 Population size = 136,593 Design df = 2
Linearized | ||
Mean std. err. [95% conf. interval] | ||
hdresult | 51.04046 . . . | |
Number of obs per unit Stratum # units # obs Min Mean Max | ||
1 2 4 2 2.0 2 2 2 4 2 2.0 2 3 2 4 2 2.0 2 | ||
3 6 12 2 2.0 2 |
The command svydescribe does not detect any stratum with singleton PSUs because by default svydescribe checks the entire dataset. However, the appropriate way here is to use the if e(sample) expression to run svydescribe within the estimation sample used by svy: mean hdresult.
. svydescribe if e(sample) Survey: Describing stage 1 sampling units pweight: finalwgt VCE: linearized Single unit: missing Strata 1: stratid SU 1: psuid FPC 1: <zero>
Number of obs per unit |
Stratum # units # obs Min Mean Max |
1 1* 2 2 2.0 2 2 2 4 2 2.0 2 3 2 4 2 2.0 2 |
3 5 10 2 2.0 2 |
2 = #Obs with missing values in the |
survey characteristics |
12 |
An alternative way to use svydescribe in this scenario is to write:
svydescribe hdresult
This line will apply svydescribe to the subset of the data where variable hdresult doesn't have missing values.
After detecting the strata with singleton PSUs, we now reassign each stratum with a singleton PSU to another properly chosen stratum. Let us look at the dataset data1.dta, saved in the previous section. We already know that only the stratum 1 has a singleton PSU. Assuming that we want to reassign stratum 1 to stratum 2, we first generate a new PSU identifier variable psu and a new strata identifier variable strata. In this way, we won't lose any information in the original dataset. Then, we need to assign distinct values to psu for all the sampling units in strata 1 and 2 so that we can differentiate each sampling unit in the combined new stratum. After that, we can change the value of strata. We also need to svyset our data again using the new variables psu and strata.
use data1, clear egen psu = group(stratid psuid) if inlist(stratid,1,2) replace psu = psuid if stratid>2 generate strata=stratid replace strata=2 if strata==1 svyset psu [pweight=finalwgt], strata(strata)
Now, let us check again if there are any strata with singleton PSUs:
. svydescribe Survey: Describing stage 1 sampling units pweight: finalwgt VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero>
Number of obs per unit Stratum # units # obs Min Mean Max | ||
2 3 6 2 2.0 2 3 2 4 2 2.0 2 | ||
2 5 10 2 2.0 2 |
All the strata have multiple PSUs now. We can go ahead and run our svy estimation commands.
An alternative solution to handle the strata with singleton PSUs is to specify the singleunit() option when we svyset the data. The default specification is singleunit(missing), which results in missing values for the standard errors. Other than that, there are three options. The first one, singleunit(certainty), will treat strata with singleton PSUs as certainty units, so those strata contribute nothing to the standard error. The second option, singleunit(scaled), is a scaled version of singleunit(certainty). The scaling factor comes from using the average of the variances from the strata with multiple sampling units for each stratum with a singleton PSU. The third option, singleunit(centered), specifies that strata with singleton PSUs be centered at the grand mean instead of the stratum mean.
Here is an example using singleunit(certainty):
. use http://www.stata-press.com/data/r15/nhanes2b, clear . svyset psuid [pweight=finalwgt], singleunit(certainty) strata(stratid) Sampling weights: finalwgt VCE: linearized Single unit: certainty Strata 1: stratid Sampling unit 1: psuid FPC 1: <zero> . svy: mean hdresult (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 8,720 Number of PSUs = 60 Population size = 98,725,345 Design df = 29
Linearized | ||
Mean std. err. [95% conf. interval] | ||
hdresult | 49.67141 .3829811 48.88813 50.4547 | |
For more details about the methodology used by Stata when estimating the variance in survey designed data, please see the entry of [SVY] variance estimation. You can decide how to specify singleunit() based on your analysis assumption.