Rita Luk <[email protected]> is a SAS user needing to use Stata. She wants to
create a dataset of tabulations, such as
OBS NAME1 VALUE1 NAME2 VALUE2 COUNT PERCENT1 PERCENT2
1 sex female . . 50 0.25 .
2 sex male . . 150 0.75 .
3 sex female age old 25 0.50 0.33
4 sex female age young 25 0.50 0.20
5 sex male age old 50 0.33 0.67
6 sex male age young 100 0.67 0.80
about which she writes,
> [...] this dataset would result from running two different tabulations. The
> first two observations [...] come from a single variable tabulation of sex.
> The subsequent 4 observations (3-6) [... result ...] from a cross
> tabulation of the dichotomous variables age and sex.
She adds,
> In stata by using [...an inventive but convoluted procedure...] I can get
> the percents, and with a little data manipulation the counts and names, but
> I cannot get the values [...] There must be a quicker way.
Problem is, Rita is still thinking SAS while using Stata.
Introduction, one-way tabulations
---------------------------------
To obtain a dataset containing the results of a one-way tabulation, here
is one recipe:
. sort <varname>
. by <varname>: gen count = _N
. by <varname>: keep if _n==_N
. keep <varname> count
Let's try it with the auto data:
. sysuse auto, clear
(1978 Automobile Data)
. sort rep78
. by rep78: gen count=_N
. by rep78: keep if _n==_N
(68 observations deleted)
. keep rep78 count
. list
rep78 count
1. 1 2
2. 2 8
3. 3 30
4. 4 18
5. 5 11
6. . 5
Note that we got the missing value. If we did not want that, we could
now type
. drop if missing(rep78)
We also do not have the percent. That's easy enough to get:
. gen sum = sum(count)
. gen percent = count/sum[_N]
Doing that with the dataset above (and leaving in the missings), we would
get:
. list
rep78 count sum percent
1. 1 2 2 .027027
2. 2 8 10 .1081081
3. 3 30 40 .4054054
4. 4 18 58 .2432432
5. 5 11 69 .1486486
6. . 5 74 .0675676
If we did not want variable sum, we could drop it by typing
. drop sum
So the entire solution is
. sort <varname>
. by <varname>: gen count = _N
. by <varname>: keep if _n==_N
. keep <varname> count
. gen sum = sum(count)
. generate percent = count/sum[_N]
. drop sum
Introduction, two-way tabulations
---------------------------------
Obtaining two-way tabulations is no more difficult that one-way tabulations:
. sort <varname1> <varname2>
. by <varname1> <varname2>: gen count = _N
. by <varname1> <varname2>: keep if _n==_N
. keep <varname1> <varname2> count
The solution is the same, we just -sort- and -by- on more variables. Getting
the two percentages is a little more work then in the one-way case:
. by <varname1>: gen sum = sum(count)
. by <varname1>: gen percent1 = count/sum[_N]
. drop sum
. sort <varname2>
. by <varname2>: gen sum = sum(count)
. by <varname2>: gen percent2 = count/sum[_N]
. drop sum
I follow the above procedure using the auto data, substituting rep78 for
<varname1> and foreign for <varname2>. Before listing the data, I
-sort rep78 foreign-:
. list, clean
rep78 foreign count percent1 percent2
1. 1 Domestic 2 1 .0384615
2. 2 Domestic 8 1 .1538462
3. 3 Domestic 27 .9 .5192308
4. 3 Foreign 3 .1 .1363636
5. 4 Domestic 9 .5 .1730769
6. 4 Foreign 9 .5 .4090909
7. 5 Domestic 2 .1818182 .0384615
8. 5 Foreign 9 .8181818 .4090909
9. . Domestic 4 .8 .0769231
10. . Foreign 1 .2 .0454545
Combining results
-----------------
We now need to rename and append to get the results in the form Rita wants.
Let's pretend I saved the results of each of the above steps as result1.dta
and result2.dta:
. use result1
. gen str name1 = "rep78"
. rename rep78 value1
. rename percent percent1
. save result1, replace
. use result2
. gen str name1="rep78"
. gen str name2="foreign"
. rename rep78 value1
. rename foreign value2
. save result2, replace
. use result1
. append using result2
Oh yes, and Rita wanted the variables in the order name1, value1, name2,
value2, count, percent1, percent2:
. order name1 value1 name2 value2 count percent1 percent2
I did the above using my example and obtained:
. list, clean
name1 value1 name2 value2 count percent1 percent2
1. rep78 1 . 2 .027027 .
2. rep78 2 . 8 .1081081 .
3. rep78 3 . 30 .4054054 .
4. rep78 4 . 18 .2432432 .
5. rep78 5 . 11 .1486486 .
6. rep78 . . 5 .0675676 .
7. rep78 1 foreign Domestic 2 1 .0384615
8. rep78 2 foreign Domestic 8 1 .1538462
9. rep78 3 foreign Domestic 27 .9 .5192308
10. rep78 3 foreign Foreign 3 .1 .1363636
11. rep78 4 foreign Domestic 9 .5 .1730769
12. rep78 4 foreign Foreign 9 .5 .4090909
13. rep78 5 foreign Domestic 2 .1818182 .0384615
14. rep78 5 foreign Foreign 9 .8181818 .4090909
15. rep78 . foreign Domestic 4 .8 .0769231
16. rep78 . foreign Foreign 1 .2 .0454545
That was the desired result.
Automation
----------
Rita wrote,
> I plan on doing A LOT of single tabulations and cross tabulations and want
> to produce an output dataset that might look like:
so now let's automate it:
program forrita
syntax varlist(min=2 max=2)
local v1 : word 1 of `varlist'
local v2 : word 2 of `varlist'
preserve
sort `v1'
by `v1': gen count = _N
by `v1': keep if _n==_N
keep `v1' count
gen sum = sum(count)
gen percent1 = count/sum[_N]
drop sum
save result1, replace
restore, preserve
sort `v1' `v2'
by `v1' `v2': gen count = _N
by `v1' `v2': keep if _n==_N
keep `v1' `v2' count
by `v1': gen sum = sum(count)
by `v1': gen percent1 = count/sum[_N]
drop sum
sort `v2'
by `v2': gen sum = sum(count)
by `v2': gen percent2 = count/sum[_N]
drop sum
sort `v1' `v2'
save result2, replace
use result1
gen str name1 = "`v1'"
rename `v1' value1
save result1, replace
use result2
gen str name1 = "`v1'"
gen str name2 = "`v2'"
rename `v1' value1
rename `v2' value2
save result2, replace
use result1
append using result2
order name1 value1 name2 value2 count percent1 percent1
restore, not
erase result1.dta
erase result2.dta
end
With the above program, I can type
. sysuse auto, clear
. forrita rep78 foreign
and obtain the desired result.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/