Title | How does dtable handle survey data? | |
Author | Mia Lv, StataCorp |
If you are working with survey data that have been svyset previously, generating a table of descriptive statistics for these data is straightforward. Simply use the svy option with dtable. There is no need to respecify the survey weights with dtable. Then all the statistics are calculated using the specified survey weights as applicable, and all the tests are calculated using the full survey settings including clustering and stratification. In this FAQ, we will be discussing statistics and tests separately.
When you specify svy with dtable, the default sample frequency statistic is sum of the weights (sumw). If you wish to report the unweighted frequency instead, you can do so by specifying the option sample( , statistics(frequency)). For example, you can have
. webuse nhanes2l, clear (Second National Health and Nutrition Examination Survey) . svyset psu [pweight=finalwgt], strata(strata) (output omitted) . dtable, svy continuous(age , statistics(mean sd)) continuous(weight , statistics(p50)) factor(sex,statistics(fvfrequency fvproportion))
Summary |
N 117,157,513 |
Age (years) 42.253 (15.502) |
Weight (kg) 70.420 |
Sex |
Male 56,159,480 0.479 |
Female 60,998,033 0.521 |
. dtable, svy continuous(age , statistics(mean sd)) continuous(weight , statistics(p50)) factor(sex,statistics(fvfrequency fvproportion)) sample(Frequency,statistics(frequency))
Summary |
Frequency 10,351 |
Age (years) 42.253 (15.502) |
Weight (kg) 70.420 |
Sex |
Male 56,159,480 0.479 |
Female 60,998,033 0.521 |
We see that the first table reports the sum of the weights and the second one reports the sample size (frequency).
Statistics for continuous and factor variables are computed using the weights previously specified with svyset. This means that we can reproduce these statistics by specifying the weights with dtable and dropping the svy option.
. dtable [pweight=finalwgt], continuous(age , statistics(mean sd)) continuous(weight , statistics(p50)) factor(sex,statistics(fvfrequency fvproportion))
Summary |
N 117,157,513 |
Age (years) 42.253 (15.502) |
Weight (kg) 70.420 |
Sex |
Male 56,159,480 0.479 |
Female 60,998,033 0.521 |
To see the detailed formulas used to calculate statistics when weights are applied, see Methods and formulas in [R] table.
On the other hand, if your goal is to report descriptive statistics for a subpopulation, you need to specify both the svy and subpop() options with dtable. And you can reproduce all the reported statistics by specifying the weights and if qualifier with dtable; the only exceptions are the variance and sd statistics because these have different formulas for subpopulation estimation.
For example, the following two commands will report identical results for all the statistics except variance and sd.
. dtable, svy subpop(if region==1) continuous(age, statistics(mean variance sd semean))
continuous(weight , statistics(p50)) factor(sex,statistics(fvfrequency fvproportion))
Summary |
N 24,237,893 |
Age (years) 43.185 239.608 (15.479) 0.355 |
Weight (kg) 70.420 |
Sex |
Male 11,880,038 0.490 |
Female 12,357,855 0.510 |
Summary |
N 24,237,893 |
Age (years) 43.185 244.896 (15.649) 0.355 |
Weight (kg) 70.420 |
Sex |
Male 11,880,038 0.490 |
Female 12,357,855 0.510 |
The formula of subpopulation variance is documented in Methods and formulas in [R] dtable.
Please note that the svy option changes the list of tests supported by dtable. For continuous variables, the Kruskal–Wallis rank test (kwallis) is not allowed with svy. As for factor variables, the following tests are disallowed with svy: Fisher's exact test (fisher), likelihood-ratio \(\chi^2\) test (lrchi2), Goodman and Kruskal's gamma (gamma), Kendall's \(\tau\) (kendall), and Cramér's V (cramer). Nevertheless, the survey-adjusted likelihood-ratio test (svylr), survey-adjusted Wald test (svywald), and survey-adjusted log-linear Wald test (svyllwald) are exclusively allowed with svy.
When the svy or subpop() option is specified with dtable, the tests for continuous variables are computed using the prefix svy: or svy, subpop(): with regress, poisson, or gsem. For factor variables, the tests are computed using the prefix svy: or subpop(): with tabulate twoway. Please refer to Methods and formulas in [R] dtable for details. Below, we demonstrate how to reproduce the test results for both continuous and factor variables.
. webuse nhanes2l, clear (Second National Health and Nutrition Examination Survey) . svyset psu [pweight=finalwgt], strata(strata) Sampling weights: finalwgt VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1: <zero> . dtable, svy subpop(if region==1) continuous(age , test(regress)) continuous(weight, test(poisson)) factor(sex, test(svywald)) by(race,tests nototal) note: using test regress across levels of race for age. note: using test poisson across levels of race for weight. note: using test svywald across levels of race for sex.
Race |
White Black Other Test |
N 22,970,498 (94.8%) 1,112,539 (4.6%) 154,856 (0.6%) |
Age (years) 43.285 (15.483) 41.626 (15.492) 39.625 (13.223) 0.617 |
Weight (kg) 71.494 (14.640) 75.437 (16.948) 56.621 (10.332) 0.010 |
Sex |
Male 11,314,500 (49.3%) 499,951 (44.9%) 65,587 (42.4%) 0.079 |
Female 11,655,998 (50.7%) 612,588 (55.1%) 89,269 (57.6%) |
Race | ||
Sex | White Black Other Total | |
Male | .4668 .0206 .0027 .4901 | |
Female | .4809 .0253 .0037 .5099 | |
Total | .9477 .0459 .0064 1 | |