Title | Using svyset for stratified multiple-stage designs | |
Author | Jeffrey Pitblado, StataCorp |
Suppose you are faced with analyzing data from the following survey design:
The population was sampled by stratifying it first and then randomly selecting several clusters for each stratum. Within each cluster, subclusters were randomly selected, and then for each subcluster individuals were randomly selected.
Your first question when analyzing survey data should always be:
How do I identify the sampling design using svyset in Stata?
Starting in Stata 9, svyset has a syntax to deal with multiple stages of clustered sampling.
Let’s make up some variable names to represent survey design characteristics:
pwt | sampling weights |
---|---|
strata1 | stage 1 strata |
su1 | stage 1 sampling units (PSU) |
fpc1 | stage 1 finite population correction |
strata2 | stage 2 strata |
su2 | stage 2 sampling units (SSU) |
fpc2 | stage 2 finite population correction |
... you get the idea.
Given the description above, the svyset command should be structured as follows:
svyset su1 [pw=pwt], strata(strata1) fpc(fpc1) /// || su2, fpc(fpc2) || _n, fpc(fpc3)(/// tells Stata to continue to the next line in ado- or do-files.)
Prior to Stata 9, where svyset accepted only the first-stage design variables, one might assume that the svyset command should be as follows:
svyset [pweight=pwt], fpc(fpc1) psu(su1) strata(strata1)
When using only the first-stage design characteristics, you must be aware that specifying an FPC implies there was no sampling within the PSUs. If this is not true, then specifying an FPC for the first stage will yield negatively biased standard errors; that is, the standard error estimates will be smaller than they should. In this case, we recommend you not svyset an FPC.
If we remove the fpc() option, then
svyset [pweight=pwt], psu(su1) strata(strata1)
will produce appropriate variance estimates, even for multistage designs.
The previous assertion is also valid if you are using the modern syntax for svyset, but, for some reason, you can only specify the first-stage characteristics. For example, some datasets come only with information on stratification and sampling units on the first stage, even if they have been collected via a multistage design. If this is the case, fpc() should not be used for the reasons explained above.
In a current Stata, you can specify the design variables for each stage, using || to delimit the stages.
Now suppose the design involved cluster sampling first, and then each cluster was stratified before the subclusters were sampled. Here we stratified in the second stage but not the first, so we should have a variable like strata2 instead of strata1:
svyset su1 [pw=pwt], fpc(fpc1) /// || su2, strata(strata2) fpc(fpc2) || _n, fpc(fpc3)
If our design involved stratified cluster sampling in both the first and second stages, the svyset command would be as follows:
svyset su1 [pw=pwt], strata(strata1) fpc(fpc1) /// || su2, strata(strata2) fpc(fpc2) || _n, fpc(fpc3)
In a current Stata, you need to know from which stage a stratum variable identifies the strata. See [SVY] svyset for more examples of how to svyset multistage designs.
Prior to Stata 9, you would use the strata() option only if your design had stratification in the first stage.