| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: "logistic scores"
My questions come at the end.
It's a habit of mine to revisit my favourite books.
Looking again at
Mosteller, F. and Tukey, J.W. 1977. Data analysis and regression.
Reading, MA: Addison-Wesley. Chs 5F, 5H, 11F, 11G.
I found a very Tukeyish way of mapping the frequencies
of a set of ordered categories (grades) to numerical scores.
Each category is treated as a slice from a standard
logistic distribution and what is returned is a centre
of gravity for that slice. The recipe is first to
calculate cumulative probabilities p for less
than each grade and cumulative probabilities P for
less or equal to each grade and then, defining
phi(p) = p ln p + (1 - p) ln (1 - p),
to calculate scores that are
(phi(P) - phi(p)) / (P - p).
(I've not re-created the derivation for myself.)
I call these "logistic scores".
The logistic is justified by Mosteller and Tukey
as convenient to work with, and as giving similar
results to Gaussian and Cauchy alternatives any way.
Computational ease is naturally less compelling in
2007 than it was in 1977, but simple and useful
still wins every time in the absence of better
alternatives.
This kind of thing goes nicely in Mata and here
is a function to do it:
// NJC 16 March 2007
// cf. Mosteller, F. and Tukey, J.W. 1977. Data analysis and regression.
// Reading, MA: Addison-Wesley. Chs 5F, 5H, 11F, 11G.
real logistic_scores(real colvector freq)
{
real colvector P, p, zero, z
real scalar k
k = rows(freq)
P = freq
for(i = 2; i <= k; i++) {
P[i] = P[i - 1] + P[i]
}
P = P / P[k]
zero = J(k, 1, 0)
z = rowmin((zero, P :* ln(P) + (1 :- P) :* ln(1 :- P)))
p = 0 \ P[1..k-1]
z = z - rowmin((zero, p :* ln(p) + (1 :- p) :* ln(1 :- p)))
z = z :/ (P - p)
return(z)
}
end
A detail that requires care is handling terms like p ln p when p is zero
and its logarithm would thus be indeterminate. It is natural
mathematically to regard the overall product as zero, but you have
to spell that out to Mata. The ? : construct seems less useful here
than comparing directly with a vector of zeros.
Any way, using the example in Mosteller and Tukey (1977, p.106)
of grades A .. E, we type in a vector of frequencies and
get scores:
: freq = (127\497\3243\231\74)
: logistic_scores(freq)
1
+----------------+
1 | -4.476586375 |
2 | -2.39817005 |
3 | .206295676 |
4 | 3.115523631 |
5 | 5.023164169 |
+----------------+
My questions:
1. My impression is that there is a tenuous connection
here with what ordered logit does, but I don't think
the latter is quite equivalent, even indirectly, because
it works with cutpoints between grades, not the grades
themselves. Someone well into that and similar models may care
to comment.
By the way, I am pretty clear (perhaps wrongly) that I
am not asking about correspondence analysis here, which
I think requires a two-way table to do its magic. I
am only interested for the moment in recipes for single variables.
2. I have a hard time finding examples of this
device of Mosteller and Tukey ever being used, apart from a
couple of instances in educational statistics. They may exist, but
I am looking in the wrong places. If anyone, especially on the
biostatistical side, recognises this as a standard tool, or can
say what people do instead, please signal.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/