[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: RE: Tables showing ordered lists [was: New question]

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	st: RE: Tables showing ordered lists [was: New question]
Date	Fri, 9 Aug 2002 14:16:29 +0100
Rodrigo Brice�o

> I have another question that maybe is related with the same 
> egen function:
> 
> I have a hospital discharges database.  I need to obtain 
> the first 10 diagnoses for each service of the hospital. 
> How can I do that? My variables are:
> 
> Servicio (cirugia, obstetricia, etc.)
> Diagnostico

This looks exactly like Rodrigo's previous question of 12 July. 
Two approaches were given in my posts of 14 and 15 July. 

Once more, I'll answer using the auto data so that 
this is of maximum benefit to others. There is a translation at 
the end to what I conjecture is an answer to his problem. 

Let's generalise: 

1. I have a criterion: it might be just magnitude, or group frequency,
or 
group mean, or group maximum, or anything else computable. 

2. I want to see a table in which entries are ordered by 
that criterion. (Possibly, I want not the whole list of entries, 
but the medallists, or the top ten, or the top k.) 

3. I may want to do that separately for groups defined by 
one or more variables. 

This does not seem a particularly exotic class of problems, 
but official Stata is short of solutions, to the best of 
my knowledge. Specific members of this class have 
been tackled by user-written programs, which is great whenever 
the right one exists for your problem, but useless otherwise. 

Another approach is to go back to first principles. 

1. I need to have the criterion in memory. Either it already 
exists, or I will need to use -generate-, -egen-, .... 

2. I need to feed that in the right order to a table command. 

* Ordering means -sort-. 

* The more low level that table command is, the more you can exercise 
control. Thus, I commend -tabdisp-. Don't be put off by the fact
that it is billed, sometimes, as a programmer's command. 

3. Sooner or later, I will want to do this separately for 
different groups. This means -by varlist:-, and a lot of 
users find that tricky. (One tutorial is in Stata Journal 2(1), 
86-102 (2002).) Fortunately, once you know the basic 
idea, the extension from one-way classifications to two-way to 
multi-way is immediate. 

Example 1
=========

I want to see a table of the top ten -makes- on -mpg-. Clearly, 
highest -mpg- is "top".  

1. The values of -mpg- already exist. 

2. A trick: -sort- sorts lowest values first, so I need to negate 
-mpg- to get the right -sort- order. 

. gen nmpg = -mpg 
. sort nmpg 
. gen order = _n 
. tabdisp order if order <= 10, c(make mpg) 

------------------------------------------
    order | Make and Model   Mileage (mpg)
----------+-------------------------------
        1 |      VW Diesel              41
        2 |     Datsun 210              35
        3 |         Subaru              35
        4 |    Plym. Champ              34
        5 | Toyota Corolla              31
        6 |     Dodge Colt              30
        7 |      Mazda GLC              30
        8 | Chev. Chevette              29
        9 |    Honda Civic              28
       10 |    Ford Fiesta              28
------------------------------------------

Comments: 

* You won't need to bother with negation if you want lowest first. 

* If you know -gsort-, you could have done this in one fewer 
line. It turns out that this doesn't help us much for more 
complicated problems. 

* -tabdisp <rowvar>- shows rows ordered by the values of <rowvar>. 
If any value of <rowvar> occurs more than once, it uses cell values
for the first observation in memory with that value. That's 
fine for us, as each value of -order- is unique. 

* The ability to use -if- gives you the functionality to 
select the top k. 

* We haven't done anything special about tied values. 

Example 2
=========

I want to see a table of the top ten -make-s on -mpg-. 
This time, I want separate listings for values of -foreign-. 

1. The values of -mpg- already exist. 

2. Once again, I negate -mpg- to get the right -sort- order. 

. gen nmpg = -mpg 

Now the most Stataish bit: 

. bysort foreign (nmpg) : gen order = _n 

Unpacking that, 

-sort- by foreign; 
and -sort- within -foreign- by -nmpg-; 
and separately for each group defined by -foreign, 
calculate -order-. 
Under -by:-, _n is interpreted within group. 

. by foreign: tabdisp order if order <= 10, c(make mpg) 

______________________________________________________________________
_________
-> foreign = Domestic

------------------------------------------
    order | Make and Model   Mileage (mpg)
----------+-------------------------------
        1 |    Plym. Champ              34
        2 |     Dodge Colt              30
        3 | Chev. Chevette              29
        4 |    Ford Fiesta              28
        5 |    Plym. Arrow              28
        6 |     Buick Opel              26
        7 |  Plym. Sapporo              26
        8 |  Plym. Horizon              25
        9 |  Olds Starfire              24
       10 |    Chev. Monza              24
------------------------------------------

______________________________________________________________________
_________
-> foreign = Foreign

------------------------------------------
    order | Make and Model   Mileage (mpg)
----------+-------------------------------
        1 |      VW Diesel              41
        2 |         Subaru              35
        3 |     Datsun 210              35
        4 | Toyota Corolla              31
        5 |      Mazda GLC              30
        6 |    Honda Civic              28
        7 | Renault Le Car              26
        8 |    VW Scirocco              25
        9 |       BMW 320i              25
       10 |      VW Rabbit              25
------------------------------------------

Comments: 

* We do need, for this, a variable with negative mpg. 

. bysort foreign (- mpg) : gen order = _n 

would be neat if it were legal syntax, but it isn't. 

* Each value of -order- is no longer guaranteed unique, 
but -by:- keeps them apart. 

* We still haven't done anything special about tied values. 

Example 3
=========

I want to see a table of the top ten -make-s on -mpg-. 
This time, I want separate listings for classes defined 
by -foreign- and -rep78-. 

. gen nmpg = -mpg 
. bysort foreign rep78 (nmpg) : gen order = _n 
. by foreign rep78: tabdisp order if order <= 10, c(make mpg) 

(It does work.) 

The principle extends not only to this two-way problem, but from 
there to multi-way problems. 

Example 4
=========

I want to see a table of the most frequent manufacturers, 
defined as the first word of -make-. We have to get 
this ourselves, but that's just to set up an example: 

. egen manuf = ends(make), head 

1. Frequency must be calculated. And we'll need 
to negate it again as we'll want highest frequency first. 

. bysort manuf : gen freq = -_N 

Under -by:-, _N is interpreted within group. 

Another way to do this is with -egen-: 

. bysort manuf: egen freq = sum(-1)

2. Now each observation with the same value of 
-manuf- has been assigned the same value of -freq-. Each 
also will need to be assigned the same
value of -order-:  

. bysort freq manuf : gen order = _n == 1 
. replace order = sum(order) 

Unpacking this again, we want 
the -manuf- with the highest frequency to 
be assigned 1, etc., so we need to -sort- 
on -freq-. But it is possible that 
different manufacturers could have the 
same frequency, so we would need to split 
any ties. Hence the first part, 

. bysort freq manuf: 

Now we tag the first value within 
each group by 1 (_n == 1 is true (numerically 
1) whenever _n is 1), and get the 
cumulative sum. This gives us blocks 
of 1s, of 2s, etc. 

We have to reverse the negation 

. replace freq = - freq 

and then we are home and dry 

. tabdisp order, c(manuf freq) 

Comments: 

* Each value is no longer unique, but 
we are quite safe, as -manuf- and -freq- 
do not vary within -order-. 

* Ties! 

Example 5 (and last)
====================

I want to see a table of the most frequent manufacturers, 
defined as the first word of -make-, but separately 
by -foreign-. This is the analogue of Rodrigo's problem. 

1. Frequency must be calculated. And we'll need 
to negate it again as we'll want highest frequency first. 

. bysort foreign manuf : gen freq = -_N 

Under -by:-, _N is interpreted within group. 

2. Now each observation within groups of 
-foreign manuf- has been 
assigned the same value of -freq-. Each 
also will need to be assigned the same
value of -order-:  

. bysort foreign freq manuf : gen order = _n == 1 
. by foreign: replace order = sum(order) 

Unpacking this again, we want our highest
level to be -foreign-, as we want separate 
tables for separate values of -foreign-. 
Within that level, we want 
the -manuf- with the highest frequency to 
be assigned 1, etc., so we need to -sort- 
on -freq-. But it is possible that 
different manufacturers could have the 
same frequency, so we would need to split 
any ties on -manuf-. 

Then we tag the first value within 
each group by 1 (_n == 1 is true (numerically 
1) whenever _n is 1), and get the 
cumulative sum. This gives us blocks 
of 1s, of 2s, etc. 

We have to reverse the negation once more 

. replace freq = - freq 

and then we are home and dry 

. bysort foreign: tabdisp order, c(manuf freq) 

Comments: 

* See comments for Example 4. 

* Ties! 

* Extension to multi-way is immediate. 

* Lots of problems still remain, including handling weights
and -if-/-in- restrictions.

P.S. for Rodrigo. modulo some translation into Spanish
(Frequencia???): 
 
. bysort Servicio Diagnostico : gen freq = _N 
. bysort Servicio freq Diagnostico : gen order = _n == 1
. by Servicio : replace order = sum(order) 
. replace freq = -freq 
. by Servicio : tabdisp order, c(Diagnostico freq) 
 

Nick 
[email protected]
<<attachment: winmail.dat>>
References:
- st: New question
  - From: "Rodrigo Brice�o" <[email protected]>
Prev by Date: st: drop exact name only
Next by Date: st: RE: How SHOULD cut behave
Previous by thread: st: New question
Next by thread: st: update to update to cut
Index(es):
- Date
- Thread