Stata has a number of features designed to handle the special requirements of complex survey data. The survey features will handle probability sampling weights, multiple stages of cluster sampling, stage-level sampling weights, stratification, and poststratification.
Variance estimates are produced using one of the five variance estimation techniques: balanced repeated replication, the bootstrap, the jackknife, successive difference replication, and Taylor linearization. See [SVY] variance estimation for an overview of these techniques.
Many different types of estimation can be performed using Stata's survey facilities:
Descriptive statistics
mean | Estimate means |
---|---|
proportion | Estimate proportions |
ratio | Estimate ratios |
tabulate (oneway) | One-way tables for survey data |
tabulate (twoway) | Two-way tables for survey data |
total | Estimate totals |
Linear regression models
churdle | Cragg hurdle regression |
---|---|
cnsreg | Constrained linear regression |
eintreg | Extended interval regression |
eregress | Extended linear regression |
etregress | Linear regression with endogenous treatment effects |
glm | Generalized linear models |
hetregress | Heteroskedastic linear regression |
intreg | Interval regression |
nl | Nonlinear least-squares estimation |
regress | Linear regression |
tobit | Tobit regression |
truncreg | Truncated regression |
Structural equation models
sem | Structural equation model estimation command |
---|---|
gsem | Generalized structural equation model estimation command |
Survival-data regression models
stcox | Cox proportional hazards model |
---|---|
stintreg | Parametric models for interval-censored survival-time data |
streg | Parametric survival models |
Binary-response regression models
biprobit | Bivariate probit regression |
---|---|
cloglog | Complementary log-log regression |
eprobit | Extended probit regression |
hetprobit | Heteroskedastic probit model |
logistic | Logistic regression, reporting odds ratios |
logit | Logistic regression, reporting coefficients |
probit | Probit regression |
scobit | Skewed logistic regression |
Discrete-response regression models
clogit | Conditional (fixed-effects) logistic regression |
---|---|
cmmixlogit | Mixed logit choice model |
cmxtmixlogit | Panel-data mixed logit choice model |
eoprobit | Extended ordered probit regression |
hetoprobit | Heteroskedastic ordered probit regression |
mlogit | Multinomial (polytomous) logistic regression |
mprobit | Multinomial probit regression |
ologit | Ordered logistic regression |
oprobit | Ordered probit regression |
slogit | Stereotype logistic regression |
ziologit | Zero-inflated ordered logit regression |
zioprobit | Zero-inflated ordered probit regression |
Fractional-response regression models
betareg | Beta regression |
---|---|
fracreg | Fractional response regression |
Poisson regression models
cpoisson | Censored Poisson regression |
---|---|
etpoisson | Poisson regression with endogenous treatment effects |
gnbreg | Generalized negative binomial regression in [R] nbreg |
nbreg | Negative binomial regression |
poisson | Poisson regression |
tnbreg | Truncated negative binomial regression |
tpoisson | Truncated Poisson regression |
zinb | Zero-inflated negative binomial regression |
zip | Zero-inflated Poisson regression |
Instrumental-variables regression models
ivprobit | Probit model continuous endogenous covariates |
---|---|
ivregress | Single-equation instrumental-variables regression |
ivtobit | Tobit model with continuous endogenous covariates |
Regression models with selection
heckman | Heckman selection model |
---|---|
heckoprobit | Ordered probit model with sample selection |
heckpoisson | Poisson regression with sample selection |
heckprobit | Probit model with sample selection |
Longitudinal/panel-data regression models
xtmlogit | Fixed-effects and random-effects multinomial logit models |
---|
Multilevel mixed-effects models
mecloglog | Multilevel mixed-effects complementary log-log regression |
---|---|
meglm | Multilevel mixed-effects generalized linear model |
meintreg | Multilevel mixed-effects interval regression |
melogit | Multilevel mixed-effects logistic regression |
menbreg | Multilevel mixed-effects negative binomial regression |
meologit | Multilevel mixed-effects ordered logistic regression |
meoprobit | Multilevel mixed-effects ordered probit regression |
mepoisson | Multilevel mixed-effects Poisson regression |
meprobit | Multilevel mixed-effects probit regression |
mestreg | Multilevel mixed-effects parametric survival models |
metobit | Multilevel mixed-effects tobit regression |
Finite mixture models
fmm: betareg | Finite mixtures of beta regression models |
---|---|
fmm: cloglog | Finite mixtures of complementary log-log regression models |
fmm: glm | Finite mixtures of generalized linear regression models |
fmm: intreg | Finite mixtures of interval regression models |
fmm: ivregress | Finite mixtures of linear regression models with endogenous covariates |
fmm: logit | Finite mixtures of logistic regression models |
fmm: mlogit | Finite mixtures of multinomial (polytomous) logistic regression models |
fmm: nbreg | Finite mixtures of negative binomial regression models |
fmm: ologit | Finite mixtures of ordered logistic regression models |
fmm: oprobit | Finite mixtures of ordered probit regression models |
fmm: pointmass | Finite mixtures models with a density mass at a single point |
fmm: poisson | Finite mixtures of Poisson regression models |
fmm: probit | Finite mixtures of probit regression models |
fmm: regress | Finite mixtures of linear regression models |
fmm: streg | Finite mixtures of parametric survival models |
fmm: tobit | Finite mixtures of tobit regression models |
fmm: tpoisson | Finite mixtures of truncated Poisson regression models |
fmm: truncreg | Finite mixtures of truncated linear regression models |
Item response theory
irt 1pl | One-parameter logistic model |
---|---|
irt 2pl | Two-parameter logistic model |
irt 3pl | Three-parameter logistic model |
irt grm | Graded response model |
irt nrm | Nominal response model |
irt pcm | Partial credit model |
irt rsm | Rating scale model |
irt hybrid | Hybrid IRT models |
Many other estimation features in Stata are suitable for certain limited survey designs. For example, Stata’s competing-risks regression routine (stcrreg) handles sampling weights properly when sampling weights are specified, and it also handles clustering.
Stata's mixed for fitting multilevel linear models allows for both sampling weights and clustering. Sampling weights may be specified at all levels in your multilevel model, and thus, by necessity, weights need to be treated differently in mixed than in other estimation commands. Some caution on the part of the user is required; see section Survey data in [ME] mixed for details. Also see example of using mixed with survey data.
estat effects computes the design effects DEFF and DEFT, as well as misspecification effects MEFF and MEFT. test, used after svy, computes adjusted Wald tests and Bonferroni tests for linear hypotheses (single or joint).
Here is an example of the use of svy: mean:
. webuse nhanes2 . svyset psu [pw=finalwgt], strata(strata) Sampling weights: finalwgt VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1:. svy: mean weight (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 Design df = 31
Linearized Mean std. err. [95% conf. interval] weight 71.90064 .1654434 71.56321 72.23806
svyset, illustrated above, allows you to set the variables that contain the sampling weights, strata, and any PSU identifiers at the outset. These variables are remembered for subsequent commands and do not have to be reentered.
Estimating the difference between two subpopulation means can be done by running svy: mean with an over() option to produce subpopulation estimates and then running lincom:
. svy: mean weight, over(sex) (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31
Linearized | ||
Mean std. err. [95% conf. interval] | ||
c.weight@sex | ||
Male | 78.62789 .2097761 78.20004 79.05573 | |
Female | 65.70701 .266384 65.16372 66.25031 | |
svy: mean, svy: prop, svy: ratio, and svy: total produce estimates for multiple subpopulations:
. svy: mean weight, over(sex race) (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31
Linearized | ||
Mean std. err. [95% conf. interval] | ||
c.weight@sex#race | ||
Male#White | 78.98862 .2125203 78.55518 79.42206 | |
Male#Black | 78.324 .8476215 76.59526 80.05273 | |
Male#Other | 68.16404 1.811668 64.46912 71.85896 | |
Female#White | 65.10844 .2926873 64.5115 65.70538 | |
Female#Black | 72.38252 1.059851 70.22094 74.5441 | |
Female#Other | 59.56941 1.325068 56.86692 62.27191 | |
Use estat effects to report DEFF and DEFT.
. estat effects
Linearized | ||
Over | Mean std. err. DEFF DEFT | |
c. | ||
weight@ | ||
sex#race | ||
Male#White | 78.98862 .2125203 1.15287 1.07372 | |
Male#Black | 78.324 .8476215 1.34608 1.16021 | |
Male#Other | 68.16404 1.811668 2.08964 1.44556 | |
Female # | ||
White | 65.10844 .2926873 2.09219 1.44644 | |
Female # | ||
Black | 72.38252 1.059851 1.93387 1.39064 | |
Female # | ||
Other | 59.56941 1.325068 1.55682 1.24772 | |
Use estat size to report the number of observations belonging to each subpopulation and estimates of the subpopulation size.
. estat size
Linearized | ||
Over | Mean std. err. Obs Size | |
c. | ||
weight@ | ||
sex#race | ||
Male#White | 78.98862 .2125203 4,312 49,504,800 | |
Male#Black | 78.324 .8476215 500 5,096,044 | |
Male#Other | 68.16404 1.811668 103 1,558,636 | |
Female # | ||
White | 65.10844 .2926873 4,753 53,494,749 | |
Female # | ||
Black | 72.38252 1.059851 586 6,093,192 | |
Female # | ||
Other | 59.56941 1.325068 97 1,410,092 | |
You can fit a wide variety of models using svy estimators (see the tables above for a list of available commands). Shown below is an example of svy: logit, which fits logistic regressions for survey data.
. webuse nhanes2d . svy: logit highbp height weight age c.age#c.age female black (running logit on estimation sample) Survey: Logistic regression Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 F(6, 26) = 231.75 Prob > F = 0.0000
Linearized | ||
highbp | Coefficient std. err. t P>|t| [95% conf. interval] | |
height | -.0345643 .0053121 -6.51 0.000 -.0453985 -.0237301 | |
weight | .051004 .0025292 20.17 0.000 .0458457 .0561622 | |
age | .0554544 .0127859 4.34 0.000 .0293774 .0815314 | |
c.age#c.age | -.0000676 .0001385 -0.49 0.629 -.0003502 .0002149 | |
female | -.4758698 .0561318 -8.48 0.000 -.5903513 -.3613882 | |
black | .338201 .1075191 3.15 0.004 .1189143 .5574877 | |
_cons | -.5140351 .8747001 -0.59 0.561 -2.297998 1.269928 | |
svy: logit can display estimates as coefficients or as odds ratios. Below we redisplay the previous model, requesting that the estimates be expressed as odds ratios.
. svy: logit, or Survey: Logistic regression Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 F(6, 26) = 231.75 Prob > F = 0.0000
Linearized | ||
highbp | Odds ratio std. err. t P>|t| [95% conf. interval | |
height | .9660262 .0051317 -6.51 0.000 .9556166 .9765492 | |
weight | 1.052327 .0026615 20.17 0.000 1.046913 1.057769 | |
age | 1.057021 .013515 4.34 0.000 1.029813 1.084947 | |
c.age#c.age | .9999324 .0001385 -0.49 0.629 .9996499 1.000215 | |
female | .6213444 .0348772 -8.48 0.000 .5541326 .6967085 | |
black | 1.402422 .1507872 3.15 0.004 1.126273 1.74628 | |
_cons | .5980774 .5231384 -0.59 0.561 .1004598 3.560595 | |
After running a logistic regression, you can use lincom to compute odds ratios for any covariate group relative to another.
. lincom female + black, or ( 1) [highbp]female + [highbp]black = 0
highbp | Odds ratio Std. err. t P>|t| [95% conf. interval] | |
(1) | .8713873 .1233177 -0.97 0.338 .6529215 1.162951 | |
You can also fit regression models for a subpopulation:
. svy, subpop(black): logistic highbp age female (running logistic on estimation sample) Survey: Logistic regression Number of strata = 30 Number of obs = 10,013 Number of PSUs = 60 Population size = 113,415,086 Subpop. no. obs = 1,086 Subpop. size = 11,189,236 Design df = 30 F(2, 29) = 83.52 Prob > F = 0.0000
Linearized | ||
highbp | Odds ratio std. err. t P>|t| [95% conf. interval] | |
age | 1.060226 .0047619 13.02 0.000 1.050546 1.069996 | |
female | .8280475 .1063299 -1.47 0.152 .6370331 1.076338 | |
_cons | .0791591 .0185411 -10.83 0.000 .0490631 .1277163 | |
Survey data require some special data management. svydescribe can be used to examine the design structure of the dataset. It can also be used to see the number of missing and nonmissing observations per stratum (or optionally per stage) for one or more variables.
. svydescribe hdresult Survey: Describing stage 1 sampling units Sampling weights: finalwgt VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1:Number of obs with Number of units complete missing # obs per included unit Stratum included omitted data data Min Mean Max
1 1* 1 114 266 114 114.0 114 2 1* 1 98 87 98 98.0 98 3 2 0 277 71 116 138.5 161 4 2 0 340 120 160 170.0 180 5 2 0 173 79 81 86.5 92 6 2 0 255 43 116 127.5 139 7 2 0 409 67 191 204.5 218 8 2 0 299 39 129 149.5 170 9 2 0 218 26 85 109.0 133 10 2 0 233 29 103 116.5 130 11 2 0 238 37 97 119.0 141 12 2 0 275 39 121 137.5 154 13 2 0 297 45 123 148.5 174 14 2 0 355 50 167 177.5 188 15 2 0 329 51 151 164.5 178 16 2 0 280 56 134 140.0 146 17 2 0 352 41 155 176.0 197 18 2 0 335 24 135 167.5 200 20 2 0 240 45 95 120.0 145 21 2 0 198 16 91 99.0 107 22 2 0 263 38 116 131.5 147 23 2 0 304 37 143 152.0 161 24 2 0 388 50 182 194.0 206 25 2 0 239 17 106 119.5 133 26 2 0 240 21 119 120.0 121 27 2 0 259 24 127 129.5 132 28 2 0 284 15 131 142.0 153 29 2 0 440 63 193 220.0 247 30 2 0 326 39 147 163.0 179 31 2 0 279 29 121 139.5 158 32 2 0 383 67 180 191.5 203 31 60 2 8,720 1,631 81 145.3 247 10,3511