Thanks to Kit Baum, the -egenmore- package on SSC has been
updated. This consists of (you've guessed it) more -egen-
functions. Most require no more than Stata 6, but some
require Stata 7, as is flagged in the package description
and the collective help -egenmore-. (Other user-written
-egen- functions can be located with -findit-.)
To get a listing of function names, type
. ssc desc egenmore
To get more details, type
. ssc type egenmore.hlp
To install, use
. ssc inst egenmore
or
. ssc inst egenmore, replace
as appropriate.
If your Stata is not up-to-date enough to include
either -findit- or -ssc-, please see the first URL under
my signature for advice.
The update consists of a single new function
-egroup()-. Its nonce name -egroup()- is intended
merely to flag a small _e_xtension to the official
Stata egen function -group()-. The extension is
that the -label- option may specify a list of
variables to use in the value labels of the new
variable. The use of this is best shown by an
example. Suppose as a small variation on examples
with the auto data, we strip off the first word
of -make-
. egen manuf = head(make)
and ask for a simple table showing frequencies:
. tab manuf
manuf | Freq. Percent Cum.
------------+-----------------------------------
AMC | 3 4.05 4.05
Audi | 2 2.70 6.76
BMW | 1 1.35 8.11
Buick | 7 9.46 17.57
Cad. | 3 4.05 21.62
Chev. | 6 8.11 29.73
Datsun | 4 5.41 35.14
Dodge | 4 5.41 40.54
Fiat | 1 1.35 41.89
Ford | 2 2.70 44.59
Honda | 2 2.70 47.30
Linc. | 3 4.05 51.35
Mazda | 1 1.35 52.70
Merc. | 6 8.11 60.81
Olds | 7 9.46 70.27
Peugeot | 1 1.35 71.62
Plym. | 5 6.76 78.38
Pont. | 6 8.11 86.49
Renault | 1 1.35 87.84
Subaru | 1 1.35 89.19
Toyota | 3 4.05 93.24
VW | 4 5.41 98.65
Volvo | 1 1.35 100.00
------------+-----------------------------------
Total | 74 100.00
This shows a familiar feature: with string variables
(and also with numeric variables with value labels
-encode-d alphabetically), we get alphabetic (strictly,
alphanumeric) order, which is great for look-up, but
often lousy for identifying patterns or interesting
features. A more useful table would be ordered on
frequency, and highest first, or so I suggest.
As it happens, there is a kludged solution to this
particular problem with -tabulate-, a program
called -tabsort-, but it is of more interest to identify
a general approach to a solution, because the same
irritation can arise with other tabular and graphical output.
We can get most of the way there in two lines of
official Stata. Calculate the frequencies ourselves,
. bysort manuf : gen freq = -_N
(remembering to negate values to get the desired
sort order), and use -egen, group() label- to
get an equivalent categorical variable.
. egen Manuf = group(freq manuf) , label
. tab Manuf
group(freq |
manuf) | Freq. Percent Cum.
------------+-----------------------------------
-7 Buick | 7 9.46 9.46
-7 Olds | 7 9.46 18.92
-6 Chev. | 6 8.11 27.03
-6 Merc. | 6 8.11 35.14
-6 Pont. | 6 8.11 43.24
-5 Plym. | 5 6.76 50.00
-4 Datsun | 4 5.41 55.41
-4 Dodge | 4 5.41 60.81
-4 VW | 4 5.41 66.22
-3 AMC | 3 4.05 70.27
-3 Cad. | 3 4.05 74.32
-3 Linc. | 3 4.05 78.38
-3 Toyota | 3 4.05 82.43
-2 Audi | 2 2.70 85.14
-2 Ford | 2 2.70 87.84
-2 Honda | 2 2.70 90.54
-1 BMW | 1 1.35 91.89
-1 Fiat | 1 1.35 93.24
-1 Mazda | 1 1.35 94.59
-1 Peugeot | 1 1.35 95.95
-1 Renault | 1 1.35 97.30
-1 Subaru | 1 1.35 98.65
-1 Volvo | 1 1.35 100.00
------------+-----------------------------------
Total | 74 100.00
The nuisance remaining is that we have the
negated frequencies cluttering up the value labels.
(Ask for a value label, and -egen, group()- uses
all the variables mentioned.) Hence the need
for a new option, which is the only thing added
in -egroup()-:
. egen Manuf2 = egroup(freq manuf) , label(manuf)
. tab Manuf2
group(manuf |
) | Freq. Percent Cum.
------------+-----------------------------------
Buick | 7 9.46 9.46
Olds | 7 9.46 18.92
Chev. | 6 8.11 27.03
Merc. | 6 8.11 35.14
< it's OK >
Peugeot | 1 1.35 95.95
Renault | 1 1.35 97.30
Subaru | 1 1.35 98.65
Volvo | 1 1.35 100.00
------------+-----------------------------------
Total | 74 100.00
This approach can be extended to other requests,
standard or bizarre. Suppose we want a table
ordered on maximum mpg:
. bysort manuf : egen maxmpg = min(-mpg)
(you can see that by hand-waving)
. egen Manuf3 = egroup(maxmpg manuf) , label(manuf)
. tabstat mpg , by(Manuf3) s(max)
Summary for variables: mpg
by categories of: Manuf3 (group(manuf))
Manuf3 | max
--------+----------
VW | 41
Datsun | 35
Subaru | 35
Plym. | 34
<it's OK too >
Fiat | 21
Volvo | 17
Linc. | 14
Peugeot | 14
--------+----------
Total | 41
-------------------
(Why we can't go
. egen Manuf3 = egroup(maxmpg), label(manuf)
Because we need to break ties on maxmpg.)
A lot of detail explaining one little option, but it may
be useful.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/