Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: RE: RE: Fractional Median


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: RE: RE: Fractional Median
Date   Tue, 20 Jan 2009 13:16:47 -0000

Whoever Dennis Miller is, his fame has not reached here. Is he a Stata
user? 

I don't vote for spurious precision, either. It's down there with mouldy
apple pie. Gauss and Aristotle both said so too, but more elegantly. 

But there is a still an issue between the purists and the pragmatists. 

Some of the purists were traumatised by a course in measurement theory
in graduate school. That can have a lifelong effect, akin to the first
really rigorous calculus course, which leaves students shaking and
convinced that up to now they have just been waving arms and making
coarse animal noises. Fortunately I did no such course (in either case).
"You must pay attention to how the data were produced! These are ordinal
scores and should be treated as such!" 

Well, yes, except that the pragmatists have a case too. Consider these
scores from recent Mata courses by different instructors: 

Gould: 3 4 4 4 4 4 5 5 
Cox:   3 3 3 4 4 4 4 5 

Dr Gould gave a tough course. Dr Cox gave a tough course too. In fact,
if we summarize these scores using "legitimate" summary measures, you
can see that these instructors had the same median (4) and mode (4). Dr
Foobar down the hall got all 5s and gets the promotion. 

The pragmatist (meaning, in my book, the good data analyst) looks at
these data and sees information there that should be dug out. There is a
systematic difference between these score sets. The data analyst should
want information-rich summaries! 

In fact, my university happily takes means of such data, and, whatever
anyone says, that usually works fine. If you do that you get 

Gould 4.125 
Cox   3.75

Cox is the one to receive more counselling. What is the problem with his
teaching? 

I've written a fractional median program too. Most of the code is just
the same as -iquantile-. It needs a little care for special cases.
First, if the median comes out as the average of two different integers,
as it can do if the number of values is even, I take that to be the end
of the story. Second, if all the values are the same, that's the median
too. 

With these data (pure fiction, perhaps I should say): 

The fractional medians are 4.1 and 3.75. 

The interpolated medians are 4.143 and 3.714 (3 d.p.). 

No surprises there, naturally. 

I'll send the code to Taggert. 

Nick 
[email protected] 

Brooks Taggert J

Nick-

As Dennis Miller might say "you sir are a mensch". You might not have
solved my problem, but you've answered my question. There is no
pre-existing ado.

I should say that was not my explanation, nor is it my preferred
statistic. I've always felt it was a bit much to impart precision where
it did not exist. To clarify a bit more. The data come from the
following questions (assigning a 5 for strongly agree, 4,3,2,and 1 for
strongly disagree):

The instructor was helpful to students.  Strongly agree   5   4   3   2
1    Strongly Disagree
The instructor was well prepared. Strongly agree   5   4   3   2  1
Strongly Disagree
The instructor communicated the subject matter clearly. Strongly agree
5   4   3   2  1    Strongly Disagree
I learned a great deal from this instructor. Strongly agree   5   4   3
2  1    Strongly Disagree
Overall, this instructor was excellent. Strongly agree   5   4   3   2
1    Strongly Disagree

A composite of the questions is created, and a single "fractional
median" is produced for use in tenure, retention and promotion
decisions. 

Funny thing is if you google search fractional median, our university's
document is the first hit. Anyhow, since they actually aggregate
responses it is not a composite of questions, but a composite of
responses, an important distinction if some of the questions above
suffer from item non-response.

Anyhow, I appreciate the effort. I'll take a stab at altering your
-iquantile- code and post the results.

Nick Cox

Like Maarten, I struggled a bit with the word descriptions here. I'd
rather see algebra or code! 

I've tried to reconstruct a more general version of this problem for
myself. I don't find the term "fractional median" especially
transparent. I prefer "interpolated median". (The result need not be
fractional, although it usually will be.) 

What Taggert wants is evidently not -_pctile, altdef-, which produces 4
for his toy dataset 

value freq 
2      2
3      9
4      8
5      8 

For the same toy data -hdquantile- from SSC gives 3.8491074 for the
median, but although it produces a sensible answer it uses a quite
different logic that would be a challenge to explain to non-technical
audiences. 

A relatively simple procedure to define (although not necessarily to
implement) is interpolation within the cumulative distribution function
or equivalently the quantile function. Although I'll lump for linear
interpolation -- there is room for discussion about that -- quite what
definition of either function to use is a little tricky. The usual
definition of cumulative probability 

proportion <= this value 

is asymmetric, and I think it would be better practice to use a
symmetric definition 

(proportion < this value) + (1/2) (proportion at this value) 

The difference can be substantial for highly lumped distributions, as
are being considered here. For other Stata uses of this definition, see
the -ridit()- function in -egenmore- from SSC or the -midpoint- option
in -distplot- from the Stata Journal files. 

If we fire up Mata as a calculator, we can first enter the data as
values and frequencies: 

: y = 2, 3, 4, 5

: f = 2, 9, 8, 8

Then work out the cumulative frequencies 

: runningsum(f)
        1    2    3    4
    +---------------------+
  1 |   2   11   19   27  |
    +---------------------+

-- and subtract half the frequencies and get the cumulative proportions,
symmetrically considered. 

: runningsum(f) :- f/2
         1     2     3     4
    +-------------------------+
  1 |    1   6.5    15    23  |
    +-------------------------+

: (runningsum(f) :- f/2) / 27
                 1             2             3             4
    +---------------------------------------------------------+
  1 |   .037037037   .2407407407   .5555555556   .8518518519  |
    +---------------------------------------------------------+

: cup = (runningsum(f) :- f/2) / 27

We see that we need to interpolate between the 2nd and 3rd values. 

: y[2] + (0.5 - cup[2]) / (cup[3] - cup[2])
  3.823529412

where (y[3] - y[2]) is omitted as it equals 1. 

Taggert's result is similar, but his logic is not identical. He gets
3.8125.  

Not surprisingly I prefer my way of thinking about it! I don't assume
that the data are adjacent integers or even integers. The little details
about cells and insisting that the fractional median belongs in the same
cell as the regular median strike me as rather arbitrary and artificial.


I think my procedure deserves a program and I've written one. The
generalisations desirable seem to be to support 

1. One or more quantiles, not just medians. 
2. One or more variables. 
3. One or more groups.  
4. Weights. 
5. Saved results. 

-- among some other details. Here is some sample output: 

. sysuse auto

. iquantile mpg, p(25(25)75) by(rep78) format(%2.1f)

  +----------------------------+
  | rep78    25%    50%    75% |
  |----------------------------|
  |     1   18.0   21.0   24.0 |
  |     2   16.5   18.0   22.7 |
  |     3   17.0   19.3   21.4 |
  |     4   17.3   22.3   25.0 |
  |     5   17.9   30.0   34.5 |
  +----------------------------+

. iquantile turn trunk, p(25(25)75) by(rep78) format(%2.1f)

 
+-----------------------------------------------------------------------
-----+
  | rep78   25% turn   50% turn   75% turn   25% trunk   50% trunk   75%
trunk |
 
|-----------------------------------------------------------------------
-----|
  |     1       40.0       41.0       42.0         7.0         8.5
10.0 |
  |     2       41.0       43.5       45.5        10.6        16.2
17.0 |
  |     3       38.2       42.0       43.4        12.4        15.8
17.1 |
  |     4       34.4       37.3       42.6         8.8        14.0
18.0 |
  |     5       35.1       35.9       36.6         9.3        11.0
14.1 |
 
+-----------------------------------------------------------------------
-----+

. iquantile mpg

  +--------+
  |    50% |
  |--------|
  | 20.125 |
  +--------+

. ret li

scalars:
   r(mpg_50_1_epolate) =  0
           r(mpg_50_1) =  20.125

(The first returned result is a flag of whether extrapolation was used.
-iquantile- can be no better than anything else at estimating extreme
quantiles without detail in the tails. It warns about any
extrapolations.) 

. iquantile mpg [w=price]
(frequency weights assumed)

  +---------+
  |     50% |
  |---------|
  | 18.9035 |
  +---------+

By the way, one test of such a procedure is that 

p% quantile for y = -((100 - p%) quantile for (-y) 

. iquantile mpg, p(25 50 75)

  +------------------------------+
  |      25%      50%        75% |
  |------------------------------|
  | 17.38461   20.125   24.55556 |
  +------------------------------+

. gen nmpg = -mpg

. iquantile nmpg, p(25 50 75)

  +---------------------------------+
  |       25%       50%         75% |
  |---------------------------------|
  | -24.55556   -20.125   -17.38461 |
  +---------------------------------+

I'll make the program available when I can. 

I am not sure where this leaves Taggert. I have not solved his problem,
preferring a different formulation and thus a different solution. I
think he does need a Stata program for what he wants, as it doesn't seem
to correspond to anything programmed. He might be able to steal some of
my code when it becomes public. 

Nick 
[email protected] 

Maarten Buis
============

I still don't understand. Where does the 8 in your example come from?
Where do the boundaries of your bins come from; are they assumptions or
is the questions asked as a set of ranges? (e.g. do you earn less x$,
between xx$ and xxx$, etc?)

Taggert J. Brooks, PhD
======================

Thanks to Steve and Maarten for the suggestions. I might be missing
something but it doesn't seem as though there isn't an easy
implementation.

For clarification here is what is meant by fractional median.

Explanation of Fractional Median

Using the median becomes problematic when the data set contains large
numbers of repeated values. In the case of SEI scores, the student can
only choose 1 of 5 values, and so, by design, the set of SEI scores of
most classes contains large groups of the same value. And so, if the
regular median were calculated, there would only be a few possible
results for all classes/instructors. The fractional median is used to
provide a wider range of possible results, while still maintaining some
of the desirable properties of the regular median. In terms of the
mathematics, the fractional median provides a more continuous range of
outcomes instead of the discrete set possible with the regular median.

Here is the basic idea (and an example): To arrive at a continuous set
of outcomes, one assumes that each data value is the center of the true
set of values that could have been measured. For example, when a student
selects a 3 instead of a 2 or a 4, one can assume that if the student
were allowed to choose any numerical value from the real line, they
would have selected something between 2.5 and 3.5, and since they were
not allowed to list their exact observation, they selected the nearest
choice, in this case a 3.

We call this range of values associated with each observable measurement
(choice) a bin or a cell. The cell for the choice 1 is .5 to 1.5, for a
2 it is 1.5 to 2.5, etc. We now want to calculate the fractional median,
which estimates what the median would have been if the student could
have selected any real value (not just the 5 choices given). First we
determine what cell the standard median lives in. The fractional median
will be a value from the cell that contains the standard median. We then
determine how far into the cell the median actually is (again assuming
they could have selected any value in the cell). This gives the
fractional median.

Example Data Set: two 2's, nine 3's, eight 4's, and eight 5's. (I picked
an odd number of values because it is a little tricky, the even number
case is a bit easier.)

This is a total of 27 measurements (student scores). Half of 27 is 13.5,
and so if we look for the thirteen and a half value, we end up looking
between two 4's. So the median is a 4, which comes from the cell ranging
from 3.5 to 4.5, and so the fractional median will be between 3.5 and
4.5.

The fractional median in this case will be 3.5 plus the percentage of
the distance into the cell the middle value represents. So if it were
the case that the true median was the middle 4, then the percentage of
the distance into the cell would be 50%= .5, thus the fractional median
would be a 3.5+.5=4 (so the median equals the fractional median in this
situation). In our example, the 4 that represents the true median would
be between the 2nd and 3rd four (the 2.5th four, let's say) of the eight
4's in the cell, which is 2.5/8 ths of the way into the cell. Now, since
2.5/8=.3125, the fractional median for this example would be
3.5+.3125=3.8125. Note: if more of the 3's were 4's, then the "middle 4"
would be a greater distance into the cell, resulting in a higher
fractional median (but the regular median would still be 4).

Steven Samuels
==============

-pctile- with the -altdef- option gives an interpolated percentile
different from that in -sum-.  See if that's what you need.

Maarten Buis
============ 

I am not quite sure what you mean with fractional median, but it sounds
similar to what Nick Cox has done with -hdquantile-, see: 
-ssc describe hdquantile- and Nick's talk at the 2007 Nordic and Baltic
Stata Users' Group Meetings: 
http://ideas.repec.org/p/boc/nsug07/1.html

Taggert J. Brooks, PhD
======================

My university uses the fractional median when calculating scores from
student evaluation of instructors. I'm helping us move electronically
and in the process was using Stata to do some of the statistics.
Strangely I can't find an easy method (ie a pre-built ado) to calculate
the fractional median, nor can I find a reference anywhere. Am I missing
something? I thought I would check here before I venture off to try and
write my own.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index