Thank you all - Steven Nick and Austin for your generous help. Finally, I get it.
best,
Cindy
----- Original Message ----
From: Austin Nichols <[email protected]>
To: [email protected]
Sent: Tuesday, 2 December, 2008 22:58:06
Subject: Re: st: ranking with weights
Cindy Gao--
I think the only error in Nick's code is one he already flagged
himself, i.e. the average of 1 and 2 is not 1 but 1.5. So the rank
need not start at 1 and need not end at the sum of weights (except in
the special case where the first/last obs has weight one and has no
ties). Perhaps the point is more clear in this example:
clear
input exp freq
1000 1
1000 1
2000 5999
2000 9000
3000 8000
3000 4000
3000 4000
10000 1000
end
bysort exp: gen tf = sum(freq)
by exp: replace tf = tf[_N]
by exp: gen first = _n == 1
gen rank = sum(tf * first)
replace rank = rank-(tf-1)/2
g p=sum(freq)
replace p=rank/p[_N]
loc adj=p[1]/2
replace p=p-`adj'
li, noo clean
The above assumes no missing values or zero weights; real data may
have missing or zero freq or missing exp requiring modification
depending on what you hope to achieve. (E.g. should a person with zero
weight get the rank of tied cases or a missing rank? What if there are
no tied cases?)
The variable p measures the rank between 0 and 1 and the oddness of
`adj' pertains to whether you want p to range from w[1]>0 to 1 or to
range from w[1]/2 to 1-w[1]/2 which some find more intuitively
appealing (also handy if you want to apply various transformations to
p that require it to be strictly between 0 and 1).
On Tue, Dec 2, 2008 at 4:59 PM, Nick Cox <[email protected]> wrote:
> My bias is to believe my code to be good until you show that it isn't. You haven't done that so far as I can see.
>
> The principle I am using is that used generally throughout statistics, that the rank applied to a bunch of tied values is (a) the same for all those tied values (b) the average of the ranks that would have been applied had those values all been distinct but otherwise still lower than all higher values and higher than all lower values. Thus, the average rank for the lowest 18000 "observations" is 9000 (or so; subject to the detail mentioned in my previous post). This is equivalent to what you would have got with your hypothetical route starting with -expand-.
>
> The example dataset deliberately included cases in which particular expenditures occurred once and also cases in which other particular expenditures occurred more than once. As far as I can see, the "ranks" check out regardless.
>
> Naturally, if you want another definition of ranks, you need different code. As Steve Samuels has I think implied from a different but not contradictory viewpoint, your use of ranks in this context is a bit iffy in the presence of (massively) tied data and you can't expect to keep all the properties of ranks that you might desire or expect.
>
> See also the discussion of ranks in the manual entry for -egen-. Some years ago (~1999) when programming what became the -track- and -field- options of -egen, rank()- I came up with those names because I couldn't find any (statistical or other) literature discussion of alternative ranking conventions, although it was evident from sports that they exist. I still haven't seen any despite continually twitching antennae. Names apart, the manual entry does give details on various different reasonable interpretations of ranks.
>
> Nick
> [email protected]
>
> Cindy Gao
>
> Thank you very much this helps a lot. However I wonder if there is a small "error" or if I am just misunderstanding. Should your last line of code ( replace rank = rank - 0.5 * totalfreq) not maybe only apply to observations that are tied (ie same expenditure as other observations)? Otherwise for example the first observation in your example, which is not tied, is ranked as 9000 instead of its weight of 18000. I therefore try a small modification to your code (by expenditure: replace rank = rank - 0.5 * totalfreq if _N != 1). when I do like this then the rank of the last observation (which is not tied) equals the sum of all the weights, whereas with your original the rank of the last observation is less than the sum of all the weights (less by half the weighting of the last observation). Now, I am not confident whether to use my modification or maybe I am just confused and I should stick with Nick's original suggestion?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/