We want to agree on a method for producing quantiles so we are all
working to the same algorithm. I was intrigued by the way Stata does it
and wondered where this came from and what the justification is.
If we have 150 observations to be grouped into quintiles - this is easy.
But what if we had 151, or 152, 153 or 154 observations?
This is how Stata 9 does it using -xtile- :
xtile newvar=rank, nquantiles(5)
----------------------------
q1 31 31 31 31
q2 30 30 31 31
q3 30 31 30 31
q4 30 30 31 31
q5 30 30 30 30
----------------------------
All 151 152 153 154
and using the -cut- function from -egen- :
egen q2=cut(rank), group(5)
----------------------------
q0 30 30 30 30
q1 30 30 31 31
q2 30 31 30 31
q3 30 30 31 31
q4 31 31 31 31
----------------------------
All 151 152 153 154
So the two methods work in opposite directions, but are otherwise
consistent in where they place the 'extra' 1 to 4 observations.
I am quite to adopt the Stata approach, but some of my colleagues do not
use Stata, so I would like to describe how the Stata algorithm works,
and why Stata does it this this way as opposed to any other way. Is this
a general convention, or more easy to justify statistically or
otherwise, or just a case of find a way that works and stick with it.
Many thanks
Daniel Dedman
Public Health Information Analyst/Project Manager
North West Public Health Observatory
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/