Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Fractional Median


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: Fractional Median
Date   Sun, 18 Jan 2009 18:55:21 -0000

Like Maarten, I struggled a bit with the word descriptions here. I'd
rather see algebra or code! 

I've tried to reconstruct a more general version of this problem for
myself. I don't find the term "fractional median" especially
transparent. I prefer "interpolated median". (The result need not be
fractional, although it usually will be.) 

What Taggert wants is evidently not -_pctile, altdef-, which produces 4
for his toy dataset 

value freq 
2      2
3      9
4      8
5      8 

For the same toy data -hdquantile- from SSC gives 3.8491074 for the
median, but although it produces a sensible answer it uses a quite
different logic that would be a challenge to explain to non-technical
audiences. 

A relatively simple procedure to define (although not necessarily to
implement) is interpolation within the cumulative distribution function
or equivalently the quantile function. Although I'll lump for linear
interpolation -- there is room for discussion about that -- quite what
definition of either function to use is a little tricky. The usual
definition of cumulative probability 

proportion <= this value 

is asymmetric, and I think it would be better practice to use a
symmetric definition 

(proportion < this value) + (1/2) (proportion at this value) 

The difference can be substantial for highly lumped distributions, as
are being considered here. For other Stata uses of this definition, see
the -ridit()- function in -egenmore- from SSC or the -midpoint- option
in -distplot- from the Stata Journal files. 

If we fire up Mata as a calculator, we can first enter the data as
values and frequencies: 

: y = 2, 3, 4, 5

: f = 2, 9, 8, 8

Then work out the cumulative frequencies 

: runningsum(f)
        1    2    3    4
    +---------------------+
  1 |   2   11   19   27  |
    +---------------------+

-- and subtract half the frequencies and get the cumulative proportions,
symmetrically considered. 

: runningsum(f) :- f/2
         1     2     3     4
    +-------------------------+
  1 |    1   6.5    15    23  |
    +-------------------------+

: (runningsum(f) :- f/2) / 27
                 1             2             3             4
    +---------------------------------------------------------+
  1 |   .037037037   .2407407407   .5555555556   .8518518519  |
    +---------------------------------------------------------+

: cup = (runningsum(f) :- f/2) / 27

We see that we need to interpolate between the 2nd and 3rd values. 

: y[2] + (0.5 - cup[2]) / (cup[3] - cup[2])
  3.823529412

where (y[3] - y[2]) is omitted as it equals 1. 

Taggert's result is similar, but his logic is not identical. He gets
3.8125.  

Not surprisingly I prefer my way of thinking about it! I don't assume
that the data are adjacent integers or even integers. The little details
about cells and insisting that the fractional median belongs in the same
cell as the regular median strike me as rather arbitrary and artificial.


I think my procedure deserves a program and I've written one. The
generalisations desirable seem to be to support 

1. One or more quantiles, not just medians. 
2. One or more variables. 
3. One or more groups.  
4. Weights. 
5. Saved results. 

-- among some other details. Here is some sample output: 

. sysuse auto

. iquantile mpg, p(25(25)75) by(rep78) format(%2.1f)

  +----------------------------+
  | rep78    25%    50%    75% |
  |----------------------------|
  |     1   18.0   21.0   24.0 |
  |     2   16.5   18.0   22.7 |
  |     3   17.0   19.3   21.4 |
  |     4   17.3   22.3   25.0 |
  |     5   17.9   30.0   34.5 |
  +----------------------------+

. iquantile turn trunk, p(25(25)75) by(rep78) format(%2.1f)

 
+-----------------------------------------------------------------------
-----+
  | rep78   25% turn   50% turn   75% turn   25% trunk   50% trunk   75%
trunk |
 
|-----------------------------------------------------------------------
-----|
  |     1       40.0       41.0       42.0         7.0         8.5
10.0 |
  |     2       41.0       43.5       45.5        10.6        16.2
17.0 |
  |     3       38.2       42.0       43.4        12.4        15.8
17.1 |
  |     4       34.4       37.3       42.6         8.8        14.0
18.0 |
  |     5       35.1       35.9       36.6         9.3        11.0
14.1 |
 
+-----------------------------------------------------------------------
-----+

. iquantile mpg

  +--------+
  |    50% |
  |--------|
  | 20.125 |
  +--------+

. ret li

scalars:
   r(mpg_50_1_epolate) =  0
           r(mpg_50_1) =  20.125

(The first returned result is a flag of whether extrapolation was used.
-iquantile- can be no better than anything else at estimating extreme
quantiles without detail in the tails. It warns about any
extrapolations.) 

. iquantile mpg [w=price]
(frequency weights assumed)

  +---------+
  |     50% |
  |---------|
  | 18.9035 |
  +---------+

By the way, one test of such a procedure is that 

p% quantile for y = -((100 - p%) quantile for (-y) 

. iquantile mpg, p(25 50 75)

  +------------------------------+
  |      25%      50%        75% |
  |------------------------------|
  | 17.38461   20.125   24.55556 |
  +------------------------------+

. gen nmpg = -mpg

. iquantile nmpg, p(25 50 75)

  +---------------------------------+
  |       25%       50%         75% |
  |---------------------------------|
  | -24.55556   -20.125   -17.38461 |
  +---------------------------------+

I'll make the program available when I can. 

I am not sure where this leaves Taggert. I have not solved his problem,
preferring a different formulation and thus a different solution. I
think he does need a Stata program for what he wants, as it doesn't seem
to correspond to anything programmed. He might be able to steal some of
my code when it becomes public. 

Nick 
[email protected] 

Maarten Buis
============

I still don't understand. Where does the 8 in your example come from?
Where do the boundaries of your bins come from; are they assumptions or
is the questions asked as a set of ranges? (e.g. do you earn less x$,
between xx$ and xxx$, etc?)

Taggert J. Brooks, PhD
======================

Thanks to Steve and Maarten for the suggestions. I might be missing
something but it doesn't seem as though there isn't an easy
implementation.

For clarification here is what is meant by fractional median.

Explanation of Fractional Median

Using the median becomes problematic when the data set contains large
numbers of repeated values. In the case of SEI scores, the student can
only choose 1 of 5 values, and so, by design, the set of SEI scores of
most classes contains large groups of the same value. And so, if the
regular median were calculated, there would only be a few possible
results for all classes/instructors. The fractional median is used to
provide a wider range of possible results, while still maintaining some
of the desirable properties of the regular median. In terms of the
mathematics, the fractional median provides a more continuous range of
outcomes instead of the discrete set possible with the regular median.

Here is the basic idea (and an example): To arrive at a continuous set
of outcomes, one assumes that each data value is the center of the true
set of values that could have been measured. For example, when a student
selects a 3 instead of a 2 or a 4, one can assume that if the student
were allowed to choose any numerical value from the real line, they
would have selected something between 2.5 and 3.5, and since they were
not allowed to list their exact observation, they selected the nearest
choice, in this case a 3.

We call this range of values associated with each observable measurement
(choice) a bin or a cell. The cell for the choice 1 is .5 to 1.5, for a
2 it is 1.5 to 2.5, etc. We now want to calculate the fractional median,
which estimates what the median would have been if the student could
have selected any real value (not just the 5 choices given). First we
determine what cell the standard median lives in. The fractional median
will be a value from the cell that contains the standard median. We then
determine how far into the cell the median actually is (again assuming
they could have selected any value in the cell). This gives the
fractional median.

Example Data Set: two 2's, nine 3's, eight 4's, and eight 5's. (I picked
an odd number of values because it is a little tricky, the even number
case is a bit easier.)

This is a total of 27 measurements (student scores). Half of 27 is 13.5,
and so if we look for the thirteen and a half value, we end up looking
between two 4's. So the median is a 4, which comes from the cell ranging
from 3.5 to 4.5, and so the fractional median will be between 3.5 and
4.5.

The fractional median in this case will be 3.5 plus the percentage of
the distance into the cell the middle value represents. So if it were
the case that the true median was the middle 4, then the percentage of
the distance into the cell would be 50%= .5, thus the fractional median
would be a 3.5+.5=4 (so the median equals the fractional median in this
situation). In our example, the 4 that represents the true median would
be between the 2nd and 3rd four (the 2.5th four, let's say) of the eight
4's in the cell, which is 2.5/8 ths of the way into the cell. Now, since
2.5/8=.3125, the fractional median for this example would be
3.5+.3125=3.8125. Note: if more of the 3's were 4's, then the "middle 4"
would be a greater distance into the cell, resulting in a higher
fractional median (but the regular median would still be 4).

Steven Samuels
==============

-pctile- with the -altdef- option gives an interpolated percentile
different from that in -sum-.  See if that's what you need.

Maarten Buis
============ 

I am not quite sure what you mean with fractional median, but it sounds
similar to what Nick Cox has done with -hdquantile-, see: 
-ssc describe hdquantile- and Nick's talk at the 2007 Nordic and Baltic
Stata Users' Group Meetings: 
http://ideas.repec.org/p/boc/nsug07/1.html

Taggert J. Brooks, PhD
======================

My university uses the fractional median when calculating scores from
student evaluation of instructors. I'm helping us move electronically
and in the process was using Stata to do some of the statistics.
Strangely I can't find an easy method (ie a pre-built ado) to calculate
the fractional median, nor can I find a reference anywhere. Am I missing
something? I thought I would check here before I venture off to try and
write my own.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index