Title | Stata 6: Survey and robust estimators | |
Author | Bill Sribney, StataCorp |
Here is Table 4 from the AmStat review [Cohen, S.B. (1997) “An Evaluation of Alternative PC-Based Software Packages Developed for the Analysis of Complex Survey Data.” American Statistician 51(3): 285-292.]:
Table 4. Approximate Execution Times for PC Software Packages to Produce Required Output ------------------------------------------------------------------------------- Type of statistic Stata SUDAAN WesVarPC ------------------------------------------------------------------------------- Means (n = 34,459) 18 minutes <1 minute 15 minutes Totals (n = 34,459) 20 minutes a a Means (n = 28,704) 13 minutes <1 minute 12 minutes Totals (n = 28,704) 14 minutes a a Ratios (n = 34,459) 64 minutes <1 minute 33 minutes Totals (n = 34,459) 23 minutes b b ------------------------------------------------------------------------------- (a) Included as output with mean estimates. (b) Included as output with ratio estimates.
By default, the svy commands compute the covariance for all combinations of variables and subgroups. If there are several variables and a lot of subgroups, this can be a sizable computation.
If the commands are run differently (one variable at a time), one can get the same output in less time. I estimate that each of the runs in Table 4 of the article could have been done in 10 minutes or less (on the reviewer's computer), rather than the 13-64 minutes cited in the table.
One likely only wants the covariances of different subgroups for the same variable. One can get only these covariances by estimating for one variable per command. This is significantly faster when there are lots of subgroups (> ~10).
For example, consider the command:
. svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr)
Age has 5 categories, race3 has 3, and smpsexr has 2. Thus there are a total of 5 x 3 x 2 = 30 subgroups. There are 3 variables, and the svymean command, in the course of estimating the 90 means, by default, also computes the 90 x 90 covariance matrix (90*91/2 = 4095 elements).
The covariances are useful if you then want to estimate the subgroup differences using the svylc command. However, it is unlikely that you want to estimate the difference of totalexp and totalsp1 in different subgroups; rather you only want to estimate the differences of totalexp (and separately totalsp1 and totalsp2) in different subgroups. Thus you can run the three commands
. svymean totalexp, by(age race3 smpsexr) . svymean totalsp1, by(age race3 smpsexr) . svymean totalsp2, by(age race3 smpsexr)
Since each of these commands computes only a 30 x 30 covariance matrix, it is faster (only 30*31/2 + 30*31/2 + 30*31/2 = 465 + 465 + 465 = 1395 elements are computed).
If you don’t care about the covariances (and don’t intend to estimate subgroup differences), you can simply use the available option.
. svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr) available
The available option automatically does computations one variable at a time. It is equivalent to running the three commands above, with one variable per command (in fact, this is what the code actually does).
The ratio timings in the review were particularly slow since 6 ratios for each subgroup were computed in each command. For example, for the svyratio command run by(age race3 smpsexr), 6 x 30 = 180 ratios were computed, along with their 180 x 180 covariance matrix.
I have attempted to duplicate two of the timings from Table 4 (p. 289) of the AmStat review.
I duplicated the runs using the commands that the reviewer did and also with a set of commands that produced the same output but much faster.
I used simulated data based on the description of the data in the review, so the number of observations and number of subgroups were the same. These are the main factors that affect the timings, so there should not be much difference due to the different data sets.
Table I: My timings compared to reviewer's timings ------------------------------------------------------------------------------- My timings(2) --------------------------------- Duplication Faster set Statistic Reviewer(1) of reviewer's runs of commands fast/slow A B C C/B ---------------------------------------------------------------------- Means 18 min. 2.80 min.(3) 1.87 min.(4) 67 % Ratios 64 min. 62.1 min.(5) 8.9 min.(6) 14 % -------------------------------------------------------------------------------
Notes:
|
Comments:
I am baffled as to why the reviewer’s “ratios” run and my duplication of it had about the same timings, whereas our “means” runs where so different. I'd expect my runs to be about three times faster given my faster machine.
Table II: Some command-by-command comparisons ------------------------------------------------------------------------------- Command Time ------------------------------------------------------------------------------ svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr) 66.5 sec. svymean totalexp, by(age race3 smpsexr) 8.8 sec. svymean totalsp1, by(age race3 smpsexr) 8.1 sec. svymean totalsp2, by(age race3 smpsexr) 8.0 sec. --------- total 24.9 sec. ------------------------------------------------------------------------------ svyratio totalsp1/totalexp totalsp2/totalexp totalsp3/totalexp totalsp4/totalexp totalsp5/totalexp totalsp6/totalexp, by(age race3 smpsexr) 35.6 min. svyratio totalsp1/totalexp, by(age race3 smpsexr) 0.49 min. svyratio totalsp2/totalexp, by(age race3 smpsexr) 0.49 min. svyratio totalsp3/totalexp, by(age race3 smpsexr) 0.48 min. svyratio totalsp4/totalexp, by(age race3 smpsexr) 0.48 min. svyratio totalsp5/totalexp, by(age race3 smpsexr) 0.48 min. svyratio totalsp6/totalexp, by(age race3 smpsexr) 0.53 min. ---------- total 2.95 min. -------------------------------------------------------------------------------
Table III: Commands used by reviewer for means (from p. 288 of review) ------------------------------------------------------------------------------- Command Number of subgroups ----------------------------------------------------------------------------- 1. svymean totalexp totalsp1 totalsp2 0 2. svymean totalexp totalsp1 totalsp2, by(age) 5 3. svymean totalexp totalsp1 totalsp2, by(smpsexr) 2 4. svymean totalexp totalsp1 totalsp2, by(race3) 3 5. svymean totalexp totalsp1 totalsp2, by(povstal) 5 6. svymean totalexp totalsp1 totalsp2, by(ratehlth) 4 7. svymean totalexp totalsp1 totalsp2, by(ssmsa) 4 8. svymean totalexp totalsp1 totalsp2, by(sregion) 4 9. svymean totalexp totalsp1 totalsp2, by(cendiv) 9 10. svymean totalexp totalsp1 totalsp2, by(povstal ratehlth) 5 x 4 = 20 11. svymean totalexp totalsp1 totalsp2, by(age race3) 5 x 3 = 15 12. svymean totalexp totalsp1 totalsp2, by(age smpsexr) 5 x 2 = 10 13. svymean totalexp totalsp1 totalsp2, by(race3 smpsexr) 3 x 2 = 6 14. svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr) 5 x 3 x 2 = 30 -------------------------------------------------------------------------------
Table IV: Faster way to get the same output as Table III commands ------------------------------------------------------------------------------- 1-9. (use same commands as Table III) 10. svymean totalexp, by(povstal ratehlth) svymean totalsp1, by(povstal ratehlth) svymean totalsp2, by(povstal ratehlth) 11. svymean totalexp, by(age race3) svymean totalsp1, by(age race3) svymean totalsp2, by(age race3) 12-13. (use same commands as Table III) 14. svymean totalexp, by(age race3 smpsexr) svymean totalsp1, by(age race3 smpsexr) svymean totalsp2, by(age race3 smpsexr) -------------------------------------------------------------------------------
Table V: Commands used by reviewer for ratios (as implied by text) ------------------------------------------------------------------------------- 1. svyratio totalsp1/totalexp totalsp2/totalexp totalsp3/totalexp totalsp4/totalexp totalsp5/totalexp totalsp6/totalexp 2. svyratio " , by(age) 3. svyratio " , by(smpsexr) 4. svyratio " , by(race3) 5. svyratio " , by(povstal) 6. svyratio " , by(ratehlth) 7. svyratio " , by(ssmsa) 8. svyratio " , by(sregion) 9. svyratio " , by(cendiv) 10. svyratio " , by(povstal ratehlth) 11. svyratio " , by(age race3) 12. svyratio " , by(age smpsexr) 13. svyratio " , by(race3 smpsexr) 14. svyratio " , by(age race3 smpsexr) -------------------------------------------------------------------------------
Table VI: Faster way to get the same output as Table V commands ------------------------------------------------------------------------------- 1-8. (use same commands as Table V) 9. svyratio totalsp1/totalexp, by(cendiv) svyratio totalsp2/totalexp, by(cendiv) svyratio totalsp3/totalexp, by(cendiv) svyratio totalsp4/totalexp, by(cendiv) svyratio totalsp5/totalexp, by(cendiv) svyratio totalsp6/totalexp, by(cendiv) 10. svyratio totalsp1/totalexp, by(povstal ratehlth) svyratio totalsp2/totalexp, by(povstal ratehlth) ... 11. svyratio totalsp1/totalexp, by(age race3) svyratio totalsp2/totalexp, by(age race3) ... 12. svyratio totalsp1/totalexp, by(age smpsexr) svyratio totalsp2/totalexp, by(age smpsexr) ... 13. svyratio totalsp1/totalexp, by(race3 smpsexr) svyratio totalsp2/totalexp, by(race3 smpsexr) ... 14. svyratio totalsp1/totalexp, by(age race3 smpsexr) svyratio totalsp2/totalexp, by(age race3 smpsexr) ... -------------------------------------------------------------------------------
Even if you run the commands as suggested above (Tables IV and VI), Stata’s svy commands are still slower than the equivalent SUDAAN runs. Unless there are dozens of subgroups, this difference should be only a few minutes.
This is because of the following:
(1) will be changed in the next release of Stata, so the available option computes no covariances whatsoever.
I, not the reviewer, was responsible for the way the commands were run in the review. I told the reviewer to put more than one variable in each command, since he was counting the total number of commands as a measure of “ease of application”.
As shown above, this is a slightly more efficient way to run the commands when there are only a few subgroups (< ∼10), but not when there are lots of subgroups. I overlooked this fact when advising the reviewer.