Rodrigo Brice�o
> I have another question that maybe is related with the same
> egen function:
>
> I have a hospital discharges database. I need to obtain
> the first 10 diagnoses for each service of the hospital.
> How can I do that? My variables are:
>
> Servicio (cirugia, obstetricia, etc.)
> Diagnostico
This looks exactly like Rodrigo's previous question of 12 July.
Two approaches were given in my posts of 14 and 15 July.
Once more, I'll answer using the auto data so that
this is of maximum benefit to others. There is a translation at
the end to what I conjecture is an answer to his problem.
Let's generalise:
1. I have a criterion: it might be just magnitude, or group frequency,
or
group mean, or group maximum, or anything else computable.
2. I want to see a table in which entries are ordered by
that criterion. (Possibly, I want not the whole list of entries,
but the medallists, or the top ten, or the top k.)
3. I may want to do that separately for groups defined by
one or more variables.
This does not seem a particularly exotic class of problems,
but official Stata is short of solutions, to the best of
my knowledge. Specific members of this class have
been tackled by user-written programs, which is great whenever
the right one exists for your problem, but useless otherwise.
Another approach is to go back to first principles.
1. I need to have the criterion in memory. Either it already
exists, or I will need to use -generate-, -egen-, ....
2. I need to feed that in the right order to a table command.
* Ordering means -sort-.
* The more low level that table command is, the more you can exercise
control. Thus, I commend -tabdisp-. Don't be put off by the fact
that it is billed, sometimes, as a programmer's command.
3. Sooner or later, I will want to do this separately for
different groups. This means -by varlist:-, and a lot of
users find that tricky. (One tutorial is in Stata Journal 2(1),
86-102 (2002).) Fortunately, once you know the basic
idea, the extension from one-way classifications to two-way to
multi-way is immediate.
Example 1
=========
I want to see a table of the top ten -makes- on -mpg-. Clearly,
highest -mpg- is "top".
1. The values of -mpg- already exist.
2. A trick: -sort- sorts lowest values first, so I need to negate
-mpg- to get the right -sort- order.
. gen nmpg = -mpg
. sort nmpg
. gen order = _n
. tabdisp order if order <= 10, c(make mpg)
------------------------------------------
order | Make and Model Mileage (mpg)
----------+-------------------------------
1 | VW Diesel 41
2 | Datsun 210 35
3 | Subaru 35
4 | Plym. Champ 34
5 | Toyota Corolla 31
6 | Dodge Colt 30
7 | Mazda GLC 30
8 | Chev. Chevette 29
9 | Honda Civic 28
10 | Ford Fiesta 28
------------------------------------------
Comments:
* You won't need to bother with negation if you want lowest first.
* If you know -gsort-, you could have done this in one fewer
line. It turns out that this doesn't help us much for more
complicated problems.
* -tabdisp <rowvar>- shows rows ordered by the values of <rowvar>.
If any value of <rowvar> occurs more than once, it uses cell values
for the first observation in memory with that value. That's
fine for us, as each value of -order- is unique.
* The ability to use -if- gives you the functionality to
select the top k.
* We haven't done anything special about tied values.
Example 2
=========
I want to see a table of the top ten -make-s on -mpg-.
This time, I want separate listings for values of -foreign-.
1. The values of -mpg- already exist.
2. Once again, I negate -mpg- to get the right -sort- order.
. gen nmpg = -mpg
Now the most Stataish bit:
. bysort foreign (nmpg) : gen order = _n
Unpacking that,
-sort- by foreign;
and -sort- within -foreign- by -nmpg-;
and separately for each group defined by -foreign,
calculate -order-.
Under -by:-, _n is interpreted within group.
. by foreign: tabdisp order if order <= 10, c(make mpg)
______________________________________________________________________
_________
-> foreign = Domestic
------------------------------------------
order | Make and Model Mileage (mpg)
----------+-------------------------------
1 | Plym. Champ 34
2 | Dodge Colt 30
3 | Chev. Chevette 29
4 | Ford Fiesta 28
5 | Plym. Arrow 28
6 | Buick Opel 26
7 | Plym. Sapporo 26
8 | Plym. Horizon 25
9 | Olds Starfire 24
10 | Chev. Monza 24
------------------------------------------
______________________________________________________________________
_________
-> foreign = Foreign
------------------------------------------
order | Make and Model Mileage (mpg)
----------+-------------------------------
1 | VW Diesel 41
2 | Subaru 35
3 | Datsun 210 35
4 | Toyota Corolla 31
5 | Mazda GLC 30
6 | Honda Civic 28
7 | Renault Le Car 26
8 | VW Scirocco 25
9 | BMW 320i 25
10 | VW Rabbit 25
------------------------------------------
Comments:
* We do need, for this, a variable with negative mpg.
. bysort foreign (- mpg) : gen order = _n
would be neat if it were legal syntax, but it isn't.
* Each value of -order- is no longer guaranteed unique,
but -by:- keeps them apart.
* We still haven't done anything special about tied values.
Example 3
=========
I want to see a table of the top ten -make-s on -mpg-.
This time, I want separate listings for classes defined
by -foreign- and -rep78-.
. gen nmpg = -mpg
. bysort foreign rep78 (nmpg) : gen order = _n
. by foreign rep78: tabdisp order if order <= 10, c(make mpg)
(It does work.)
The principle extends not only to this two-way problem, but from
there to multi-way problems.
Example 4
=========
I want to see a table of the most frequent manufacturers,
defined as the first word of -make-. We have to get
this ourselves, but that's just to set up an example:
. egen manuf = ends(make), head
1. Frequency must be calculated. And we'll need
to negate it again as we'll want highest frequency first.
. bysort manuf : gen freq = -_N
Under -by:-, _N is interpreted within group.
Another way to do this is with -egen-:
. bysort manuf: egen freq = sum(-1)
2. Now each observation with the same value of
-manuf- has been assigned the same value of -freq-. Each
also will need to be assigned the same
value of -order-:
. bysort freq manuf : gen order = _n == 1
. replace order = sum(order)
Unpacking this again, we want
the -manuf- with the highest frequency to
be assigned 1, etc., so we need to -sort-
on -freq-. But it is possible that
different manufacturers could have the
same frequency, so we would need to split
any ties. Hence the first part,
. bysort freq manuf:
Now we tag the first value within
each group by 1 (_n == 1 is true (numerically
1) whenever _n is 1), and get the
cumulative sum. This gives us blocks
of 1s, of 2s, etc.
We have to reverse the negation
. replace freq = - freq
and then we are home and dry
. tabdisp order, c(manuf freq)
Comments:
* Each value is no longer unique, but
we are quite safe, as -manuf- and -freq-
do not vary within -order-.
* Ties!
Example 5 (and last)
====================
I want to see a table of the most frequent manufacturers,
defined as the first word of -make-, but separately
by -foreign-. This is the analogue of Rodrigo's problem.
1. Frequency must be calculated. And we'll need
to negate it again as we'll want highest frequency first.
. bysort foreign manuf : gen freq = -_N
Under -by:-, _N is interpreted within group.
2. Now each observation within groups of
-foreign manuf- has been
assigned the same value of -freq-. Each
also will need to be assigned the same
value of -order-:
. bysort foreign freq manuf : gen order = _n == 1
. by foreign: replace order = sum(order)
Unpacking this again, we want our highest
level to be -foreign-, as we want separate
tables for separate values of -foreign-.
Within that level, we want
the -manuf- with the highest frequency to
be assigned 1, etc., so we need to -sort-
on -freq-. But it is possible that
different manufacturers could have the
same frequency, so we would need to split
any ties on -manuf-.
Then we tag the first value within
each group by 1 (_n == 1 is true (numerically
1) whenever _n is 1), and get the
cumulative sum. This gives us blocks
of 1s, of 2s, etc.
We have to reverse the negation once more
. replace freq = - freq
and then we are home and dry
. bysort foreign: tabdisp order, c(manuf freq)
Comments:
* See comments for Example 4.
* Ties!
* Extension to multi-way is immediate.
* Lots of problems still remain, including handling weights
and -if-/-in- restrictions.
P.S. for Rodrigo. modulo some translation into Spanish
(Frequencia???):
. bysort Servicio Diagnostico : gen freq = _N
. bysort Servicio freq Diagnostico : gen order = _n == 1
. by Servicio : replace order = sum(order)
. replace freq = -freq
. by Servicio : tabdisp order, c(Diagnostico freq)
Nick
[email protected]
<<attachment: winmail.dat>>