|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: text manipulation of tabulate output
Suppose you have a list of categorical (qualitative) variables that are in your data; each variable has some arbitrary number of categories; and you want to produce a report, in text,
- with one row for each variable and
- a list of the percents in each category for each variable.
The code below produces the following display for a set of variables:
Paired_Biopsy_: number of categories=2; total nonmissing= 250; 81.2%, 18.8%
ALTcode: number of categories=3; total nonmissing= 250; 41.2%, 56%, 2.8%
Alcohol: number of categories=2; total nonmissing= 250; 75.2%, 24.8%
CDC_class: number of categories=8; total nonmissing= 161; 26.1%, 23.6%, 11.2%, 8.7%, 4.3%, .6%, 1.9%, 23.6%
BEGIN CODE
local QualitVars ///
Paired_Biopsy_ ///
ALTcode ///
Alcohol ///
CDC_class ///
display "`QualitVars'"
tabulate CDC_class
generate DUMMYjunk=0
foreach THISVAR of varlist `QualitVars' ///
{
display " "
display "`THISVAR'" ": " _continue
drop DUMMY*
quietly: tabulate `THISVAR', generate (DUMMY)
scalar nCategories=r(r)
scalar denominator=r(N)
display "number of categories=" nCategories "; total nonmissing= " denominator "; " _continue
local index=0
while `index' < nCategories {
local index=`index' + 1
quietly: summarize DUMMY`index'
scalar thispercent= round( 100* r(sum)/denominator, 0.1)
display thispercent "%" _continue
if `index' < nCategories {
display ", " _continue
}
}
}
END CODE
Question Number One: Am I reinventing the wheel? Is there an easier way to do this?
Question Number Two: Is there a way to get the labels for the categories for each variable?
For instance, the labels for CDC_class are:
. tabulate CDC_class
CDC_class | Freq. Percent Cum.
------------+-----------------------------------
A1 | 42 26.09 26.09
A2 | 38 23.60 49.69
A3 | 18 11.18 60.87
B2 | 14 8.70 69.57
B3 | 7 4.35 73.91
C1 | 1 0.62 74.53
C2 | 3 1.86 76.40
C3 | 38 23.60 100.00
------------+-----------------------------------
Total | 161 100.00
so that the output should really look like this:
CDC_class: number of categories=8; total nonmissing= 161; A1: 26.1%, A2: 23.6%, A3: 11.2%, B2: 8.7%, B3: 4.3%, C1: .6%, C2: 1.9%, C3: 23.6%
The format of the output of the tabulate command above, suggests that fancy text manipulation (using perl, for instance) of that output would be a way to eliminate the fancy loop above *and* to get the category labels. But is there a more direct way?
Thank you for any pointers
Jake
Jacob A. Wegelin
[email protected]
Assistant Professor
Department of Biostatistics
Virginia Commonwealth University
730 East Broad Street Room 3006
P. O. Box 980032
Richmond VA 23298-0032
U.S.A.
http://www.people.vcu.edu/~jwegelin