Title | Combine multiple tables obtained with table or dtable using collect | |
Author |
Mia Lv, StataCorp Chris Cheng, StataCorp |
You may already be familiar with the new table command, but you are wondering how to combine multiple different tables obtained with table into one table (by appending them vertically or horizontally). For example, you may want to combine
. sysuse auto (1978 automobile data) . table (rep78), statistic(mean price) statistic(sd price)
Mean Standard deviation | ||
Repair record 1978 | ||
1 | 4564.5 522.5519 | |
2 | 5967.625 3579.357 | |
3 | 6429.233 3525.14 | |
4 | 6071.5 1709.608 | |
5 | 5913 2615.763 | |
Total | 6146.043 2912.44 | |
and
. table (foreign), statistic(mean price) statistic(sd price)
Mean Standard deviation | ||
Car origin | ||
Domestic | 6072.423 3097.104 | |
Foreign | 6384.682 2621.915 | |
Total | 6165.257 2949.496 | |
into one table:
Mean Standard deviation | ||
Repair record 1978 | ||
1 | 4564.5 522.5519 | |
2 | 5967.625 3579.357 | |
3 | 6429.233 3525.14 | |
4 | 6071.5 1709.608 | |
5 | 5913 2615.763 | |
Total | 6146.043 2912.44 | |
Car origin | ||
Domestic | 6072.423 3097.104 | |
Foreign | 6384.682 2621.915 | |
Total | 6165.257 2949.496 | |
Before we see the solution, I want to first show you a (commonly used) wrong way to combine two tables. Many users thought we could just use table with two variables in the row dimension because the final table has two row variables:
. table (foreign rep78), statistic(mean price) statistic(sd price)
However, this syntax will generate a different table using crossed categories between foreign and rep78 instead of including them independently. This syntax is not wrong; however, it just does not fit our purpose of appending two tables together.
Mean Standard deviation | ||
Car origin | ||
Domestic | ||
Repair record 1978 | ||
1 | 4564.5 522.5519 | |
2 | 5967.625 3579.357 | |
3 | 6607.074 3661.267 | |
4 | 5881.556 1592.019 | |
5 | 4204.5 311.8341 | |
Total | 6179.25 3188.969 | |
Foreign | ||
Repair record 1978 | ||
3 | 4828.667 1285.613 | |
4 | 6261.444 1896.092 | |
5 | 6292.667 2765.629 | |
Total | 6070.143 2220.984 | |
Total | ||
Repair record 1978 | ||
1 | 4564.5 522.5519 | |
2 | 5967.625 3579.357 | |
3 | 6429.233 3525.14 | |
4 | 6071.5 1709.608 | |
5 | 5913 2615.763 | |
Total | 6146.043 2912.44 | |
we can see that the rows are identified by two variables jointly while rep78 is nested within foreign.
Let’s continue with the example at the beginning of this FAQ, which appends two tables vertically. The appropriate way to generate the combined table is to specify the append option with each table command except the first one and then use collect layout to include all the categorical variables as the row tags.
Here is the code to generate the combined table that we are looking for:
. sysuse auto, clear (1978 automobile data) . collect clear . table (rep78), statistic(mean price) statistic(sd price) . table (foreign), statistic(mean price) statistic(sd price) append . collect layout (rep78 foreign) (result)
The output table looks like
Mean Standard deviation | ||
Repair record 1978 | ||
1 | 4564.5 522.5519 | |
2 | 5967.625 3579.357 | |
3 | 6429.233 3525.14 | |
4 | 6071.5 1709.608 | |
5 | 5913 2615.763 | |
Total | 6146.043 2912.44 | |
Car origin | ||
Domestic | 6072.423 3097.104 | |
Foreign | 6384.682 2621.915 | |
Total | 6165.257 2949.496 | |
Let us explain how this code works.
. collect clear
. table (rep78), statistic(mean price) statistic(sd price)
. table (foreign), statistic(mean price) statistic(sd price) append
After the second table command, the collect layout specification shows only the second table by default. We can type collect layout to double-check the current setting.
. collect layout Collection: Table Rows: foreign Columns: result Table 1: 4 x 2
Mean Standard deviation | ||
Car origin | ||
Domestic | 6072.423 3097.104 | |
Foreign | 6384.682 2621.915 | |
Total | 6165.257 2949.496 | |
We see the rows only include the levels of foreign. Now let’s add rep78 to the rows using the following syntax:
. collect layout (rep78 foreign) (result)
This above syntax allows rep78 (all the levels) and foreign (all the levels) to be used to identify the table rows. And we do not need to change other specifications such as columns or the collection name.
On the other hand, if we want to include only the result of certain levels of any categorical variable instead of all the levels in the combined table, we can do that by including the levels we want in the row specification. Here is an example:
. collect layout (rep78[2 3 4] foreign[1]) (result) Collection: Table Rows: rep78[2 3 4] foreign[1] Columns: result Table 1: 6 x 2
Mean Standard deviation | ||
Repair record 1978 | ||
2 | 5967.625 3579.357 | |
3 | 6429.233 3525.14 | |
4 | 6071.5 1709.608 | |
Car origin | ||
Foreign | 6384.682 2621.915 | |
If you want to hide the variable label Car origin
in the table row header for the variable foreign, you can use the following command:
. collect style header foreign, title(hide)
And here is the output table:
. collect layout Collection: Table Rows: rep78[2 3 4] foreign[1] Columns: result Table 1: 5 x 2
Mean Standard deviation | ||
Repair record 1978 | ||
2 | 5967.625 3579.357 | |
3 | 6429.233 3525.14 | |
4 | 6071.5 1709.608 | |
Foreign | 6384.682 2621.915 | |
This method also works if you are appending more than two tables into one. Here is an example:
. webuse nhanes2l, clear (Second National Health and Nutrition Examination Survey) . collect clear . local varlist highlead highbp agegrp hlthstat heartatk diabetes . foreach var of local varlist { table (`var'), statistic(frequency) statistic(percent) name(Table) append totals(`var') } . collect layout (`varlist') (result)
Frequency Percent | ||
High lead level | ||
lead<25 | 4,655 94.08 | |
lead>=25 | 293 5.92 | |
High blood pressure | ||
0 | 5,975 57.72 | |
1 | 4,376 42.28 | |
Age group | ||
20–29 | 2,320 22.41 | |
30–39 | 1,622 15.67 | |
40–49 | 1,272 12.29 | |
50–59 | 1,291 12.47 | |
60–69 | 2,860 27.63 | |
70+ | 986 9.53 | |
Health status | ||
Excellent | 2,407 23.29 | |
Very good | 2,591 25.07 | |
Good | 2,938 28.43 | |
Fair | 1,670 16.16 | |
Poor | 729 7.05 | |
Prior heart attack | ||
No heart attack | 9,873 95.40 | |
Had heart attack | 476 4.60 | |
Diabetes status | ||
Not diabetic | 9,850 95.18 | |
Diabetic | 499 4.82 | |
You may notice that in this example, append is specified with every table command, even the first. This is because we are using a for loop, and we make every table command have the same options. Although append is not necessary with the first table command, it will not affect anything because we start from an empty collection (we called collect clear before).
By the way, this example aims to show you how to combine multiple tables into one using a loop. However, this type of descriptive table may be created much more easily using one dtable command. For example,
. webuse nhanes2l, clear (Second National Health and Nutrition Examination Survey) . dtable, factor(highlead highbp agegrp hlthstat heartatk diabetes)
Summary | ||
N 10,351 High lead level lead<25 4,655 (94.1%) lead>=25 293 (5.9%) High blood pressure 0 5,975 (57.7%) 1 4,376 (42.3%) Age group 20–29 2,320 (22.4%) 30–39 1,622 (15.7%) 40–49 1,272 (12.3%) 50–59 1,291 (12.5%) 60–69 2,860 (27.6%) 70+ 986 (9.5%) Health status Excellent 2,407 (23.3%) Very good 2,591 (25.1%) Good 2,938 (28.4%) Fair 1,670 (16.2%) Poor 729 (7.1%) Prior heart attack No heart attack 9,873 (95.4%) Had heart attack 476 (4.6%) Diabetes status Not diabetic 9,850 (95.2%) Diabetic 499 (4.8%) | ||
Please note that dtable does not align each statistic like table does. Instead, it bundles all the statistics in the same cell and aligns the combined statistics as a whole. That is because dtable often deals with a mixture of different statistics to accommodate different variable types. However, if you want to change this style, you can further customize the table using the collect suite of commands after you call dtable. In this case, you can use the command
. collect layout (var) (result[fvfrequency fvpercent])
Let’s say you want to combine
Birthweight<2500g | ||
0 1 | ||
Race | ||
White | 73 23 | |
Black | 15 11 | |
Other | 42 25 | |
and
Smoked during pregnancy | ||
Nonsmoker Smoker | ||
Race | ||
White | 44 52 | |
Black | 16 10 | |
Other | 55 12 | |
into the following table:
Birthweight<2500g Smoked during pregnancy | ||
0 1 Nonsmoker Smoker | ||
Race | ||
White | 73 23 44 52 | |
Black | 15 11 16 10 | |
Other | 15 11 16 10 | |
It contains two side-by-side tables, which are combined horizontally. How can we do that?
You can use the similar method we adopted in example 1a. In the collect layout command, instead of specifying two row tags, you should specify two column tags. Here is the code to generate the above (combined) table:
. clear all . webuse lbw (Hosmer & Lemeshow data) . table race low, nototal . table race smoke, nototal append . collect layout (race) (low smoke)
As an alternative, you can also use dtable and collect to generate the same table:
. clear all . webuse lbw (Hosmer & Lemeshow data) . dtable, by(race, nototals) nosample factor(low smoke, statistics( fvfrequency)) style(table) . collect layout (race#result) (var)
dtable generates a descriptive table that is the transpose of our target table. Then we need to transpose that descriptive table by switching the order of row tags and column tags in collect layout to generate the table we want. For more detailed information about how to transpose a table, please see FAQ: How do you transpose a table generated with collect/table/dtable/etable?
By the way, dtable has a different default style than table. Here I specified the option style(table) with dtable to force it to have the same style as table so that the generated table will look exactly the same as the above combined table generated with table and collect.
Except for the above situations where we want to combine different tables with a joint row variable (horizontally), sometimes we want to combine tables with different row variables. For example, we want to combine these three one-way frequency tables:
. table var1
Frequency | ||
var1 | ||
1 | 7 | |
2 | 7 | |
3 | 5 | |
Total | 19 | |
. table var2
Frequency | ||
var2 | ||
1 | 8 | |
2 | 5 | |
3 | 2 | |
Total | 15 | |
. table var3
Frequency | ||
var3 | ||
1 | 2 | |
2 | 5 | |
3 | 6 | |
Total | 13 | |
into
Frequency | ||
var1 var2 var3 | ||
1 | 7 8 2 | |
2 | 7 5 5 | |
3 | 5 2 6 | |
Total | 19 15 13 | |
It makes sense to combine them because var1–var3 all have the same set of unique values (1,2,3).
When we have more than one command to collect items into a collection (we need to specify the append option to avoid clearing up the previously saved items), there is a dimension called cmdset tracking different items saved by different commands. We can list all of its levels by
. collect levelsof cmdset
Now we are thinking about how to utilize that dimension to define the columns in the final table. Our first try is
. clear all . set obs 20 . set seed 7011 . generate var1 = runiformint(1,3) if uniform()<0.8 . generate var2 = runiformint(1,3) if uniform()<0.8 . generate var3 = runiformint(1,3) if uniform()<0.8 . table var1, . table var2, append . table var3, append . collect layout (var1 var2 var3) (cmdset)
However, the output table looks like
Collection: Table Rows: var1 var2 var3 Columns: cmdset Table 1: 15 x 3
Command results index | ||
1 2 3 | ||
var1 | ||
1 | 7 | |
2 | 7 | |
3 | 5 | |
Total | 19 | |
var2 | ||
1 | 8 | |
2 | 5 | |
3 | 2 | |
Total | 15 | |
var3 | ||
1 | 2 | |
2 | 5 | |
3 | 6 | |
Total | 13 | |
Now we find that the table has nine rows instead of three rows as expected. This is because Stata sees var1[1], var2[1], and var3[1] as three different tags. Let’s fix that using collect remap and adjust the format as well.
. collect remap var2 = var1 . collect remap var3 = var1 . collect style header var1, title(hide) . collect style header cmdset, title(hide) . collect label levels cmdset 1 "var1", modify . collect label levels cmdset 2 "var2", modify . collect label levels cmdset 3 "var3", modify . collect layout (var1) (cmdset)
After collect remap, the dimensions var2 and var3 all become var1, so we can use var1 alone to define the rows for the table. The output table becomes
var1 var2 var3 | ||
1 | 7 8 2 | |
2 | 7 5 5 | |
3 | 5 2 6 | |
Total | 19 15 13 | |
That’s the exact table we want to generate.
The remap strategy in example 2b applies to circumstances where we want to combine tables with different row or column variables and those variables have the same set of values.
Now let’s go back to example 2a. What if we would like to put the two two-way tables
Birthweight<2500g | ||
0 1 | ||
Race | ||
White | 73 23 | |
Black | 15 11 | |
Other | 42 25 | |
and
Smoked during pregnancy | ||
Nonsmoker Smoker | ||
Race | ||
White | 44 52 | |
Black | 16 10 | |
Other | 55 12 | |
atop one another instead of side-by-side, given both low and smoke are binary? We can use the following code:
. clear all . webuse lbw (Hosmer & Lemeshow data) . table race low, nototal . table race smoke, nototal append . collect remap smoke = low . collect layout (cmdset#race) (low) . collect style header low cmdset, title(hide) . collect label levels cmdset 1Birthweight<2500g2Smoked during pregnancy. collect layout
This table has race nested within cmdset on rows, and low defines the columns (race is remapped to low).
The output table looks like
0 1 | ||
Birthweight<2500g | ||
Race | ||
White | 73 23 | |
Black | 15 11 | |
Other | 42 25 | |
Smoked during pregnancy | ||
Race | ||
White | 44 52 | |
Black | 16 10 | |
Other | 55 12 | |
In the following example, we generate three descriptive tables for the same variable in three different subpopulation groups using dtable.
. dtable i.foreign mpg if rep78==3
Summary | ||
N 30 | ||
Car origin | ||
Domestic 27 (90.0%) | ||
Foreign 3 (10.0%) | ||
Mileage (mpg) 19.433 (4.141) | ||
. dtable i.foreign mpg if rep78==4
Summary | ||
N 18 | ||
Car origin | ||
Domestic 9 (50.0%) | ||
Foreign 9 (50.0%) | ||
Mileage (mpg) 21.667 (4.935) | ||
. dtable i.foreign mpg if rep78==5
Summary | ||
N 11 | ||
Car origin | ||
Domestic 2 (18.2%) | ||
Foreign 9 (81.8%) | ||
Mileage (mpg) 27.364 (8.732) | ||
We want to combine them into one big table. However, we cannot use append because this option is not available with dtable. Instead, we can save them in different collections first (by specifying the option name() with dtable to overwrite the default collection name DTable) and then combine them using collect combine. When we run collect combine, Stata will create a dimension called collection in the combined collection to mark which collection each item is originally from. Then we can use the dimension collection to define the rows or columns in our table.
Here is the code:
. clear all . sysuse auto (1978 automobile data) . dtable i.foreign mpg if rep78==3, name(a1) . dtable i.foreign mpg if rep78==4, name(a2) . dtable i.foreign mpg if rep78==5, name(a3) . collect combine all = a1 a2 a3 . *change the display format for means and sds . collect style cell result[mean sd], nformat(%8.2fc) . collect layout (collection#var) (result)
Now we can specify the layout by having var nested in the three collections on rows, while result is put on the columns. In this way, we obtain a long table.
Summary | ||
a1 N 30 Car origin Domestic 27 (90.0%) Foreign 3 (10.0%) Mileage (mpg) 19.43 (4.14) a2 N 18 Car origin Domestic 9 (50.0%) Foreign 9 (50.0%) Mileage (mpg) 21.67 (4.93) a3 N 11 Car origin Domestic 2 (18.2%) Foreign 9 (81.8%) Mileage (mpg) 27.36 (8.73) | ||
Another possible layout is to have result nested within collection on the columns. Then we combine the three tables horizontally, and we obtain a wide table.
. collect layout (var) (collection#result) Collection: all Rows: var Columns: collection#result Table 1: 5 x 3
a1 a2 a3 | ||
Summary Summary Summary | ||
N 30 18 11 Car origin Domestic 27 (90.0%) 9 (50.0%) 2 (18.2%) Foreign 3 (10.0%) 9 (50.0%) 9 (81.8%) Mileage (mpg) 19.43 (4.14) 21.67 (4.93) 27.36 (8.73) | ||
Occasionally, we are going to need to combine tables generated by dtable and table. For example, we want to generate a descriptive table for a set of variables, including continuous variables and categorical variables. More specifically, we need to obtain means and sds for continuous variables and obtain frequencies and percentages by rural for all the categorical variables. In this case, dtable can take care of the statistics for continuous variables. However, dtable cannot obtain factor percentages by another variable, which can be done by table. In this situation, we need to run dtable for continuous variables and table for categorical variables and then combine two tables together. Here is the complete code:
. clear all . webuse nhanes2l (1978 automobile data) . * continuous variable part . dtable height weight bpsystol albumin vitaminc zinc copper, by(rural, nototal) . * factor variable part . local vlist highlead highbp agegrp hlthstat heartatk diabetes . foreach var of local vlist { table (`var') (rural), statistic(frequency) statistic(percent, across(rural)) name(DTable) append nototal } . collect style header highlead highbp agegrp hlthstat heartatk diabetes, title(label) . collect layout (var highlead highbp agegrp hlthstat heartatk diabetes) (rural#result[_dtable_stats])
The output table looks like
Rural | ||
Urban Rural | ||
N 6,548 (63.3%) 3,803 (36.7%) Height (cm) 167.457 (9.761) 167.984 (9.465) Weight (kg) 71.427 (15.363) 72.708 (15.314) Systolic blood pressure 130.460 (23.526) 131.607 (22.980) Serum albumin (g/dL) 4.665 (0.333) 4.677 (0.328) Serum vitamin C (mg/dL) 1.066 (0.592) 0.981 (0.558) Serum zinc (mcg/dL) 86.399 (14.506) 86.691 (14.432) Serum copper (mcg/dL) 126.054 (33.126) 124.860 (31.469) High lead level lead<25 2,902 (62.3%) 1,753 (37.7%) lead>=25 209 (71.3%) 84 (28.7%) High blood pressure 0 3,798 (63.6%) 2,177 (36.4%) 1 2,750 (62.8%) 1,626 (37.2%) Age group 20–29 1,631 (70.3%) 689 (29.7%) 30–39 981 (60.5%) 641 (39.5%) 40–49 777 (61.1%) 495 (38.9%) 50–59 810 (62.7%) 481 (37.3%) 60–69 1,743 (60.9%) 1,117 (39.1%) 70+ 606 (61.5%) 380 (38.5%) Health status Excellent 1,609 (66.8%) 798 (33.2%) Very good 1,713 (66.1%) 878 (33.9%) Good 1,878 (63.9%) 1,060 (36.1%) Fair 950 (56.9%) 720 (43.1%) Poor 389 (53.4%) 340 (46.6%) Prior heart attack No heart attack 6,272 (63.5%) 3,601 (36.5%) Had heart attack 275 (57.8%) 201 (42.2%) Diabetes status Not diabetic 6,233 (63.3%) 3,617 (36.7%) Diabetic 314 (62.9%) 185 (37.1%) | ||
In the above code, we have specified the option append with each table command to append the statistics we compute using table to the same collection that was created by the dtable command before (we must specify the collection name using name(DTable); otherwise, it will use the default name Table). If the append option is not specified, it will first clear the collection and then save its results.
You may wonder why we use the result tag result[_dtable_stats] in the final layout. This is because _dtable_stats is a composite result, which includes frequency, percent, mean, and sd, which are all the statistics involved in this table. On each row, these four statistics will not be all available together, but any available ones will be displayed. This composite result is generated by the first dtable command. You can check all the elements of this composite result by typing
. collect query composite _dtable_stats Composite definition Collection: DTable Composite: _dtable_stats Elements: frequency percent mean sd Delimiter: " " Trim: on Override: off
Read more about tables in the Customizable Tables and Collected Results Reference Manual.