Title | Calculating percentile ranks or plotting positions | |
Author | Nicholas J. Cox, Durham University, UK |
Some kinds of data are often reported as percentile ranks. A test score may be reported as a percentile rank of 95% if 95% of scores are less than or equal to that score. A newborn baby’s weight may be reported in the same way.
The idea of a plotting position is essentially similar, except that conventionally plotting positions are reported as proportions rather than percentages. Their name stems from their use in probability plots. Given a set of ordered values, proportions are assigned recording the fraction of values at or below each value. These proportions may then be used directly or indirectly in plots. For example, a quantile–quantile plot for testing normality of distribution compares observed quantiles with quantiles for a sample of the same size from a normal distribution, determined by evaluating the quantile (inverse distribution) function for the normal at the plotting positions.
Official Stata has no functions for these calculations, although user-defined functions may be found. They may, however, be performed easily using results from other functions. The lack of built-in functions has one advantage: it obliges users to select the precise procedure to be used, given that several slightly different rules exist.
The calculation of the empirical cumulative distribution, at least as implemented in Stata, is a related but subtly different problem. In official Stata, this calculation may be performed using cumul. cumul by default appears to be most closely geared to preparing a plot of cumulative probabilities (y axis) versus observed values (x axis). When tied values occur, cumul will not (again, by default) assign the same cumulative probability to each tied value. This is immaterial and indeed even desirable for graphical purposes: vertical line segments will be produced whatever the convention about ties, and when visible plot symbols are used, it is often helpful to get an impression of the frequency of ties. In contrast, this FAQ is focused mainly on problems in which identical values should all be assigned the same proportion or percentage. Incidentally, an equal option was added to cumul on 17 June 2003 to override the default and ensure that this condition is satisfied.
To assign ranks, we might sort on a variable (say, mpg from the auto dataset) and then use the observation number:
. sysuse auto . sort mpg . gen rank = _n
As sorting in Stata puts lowest values first, the lowest value—or more precisely, the value sorted to the top of the dataset—would be assigned rank 1. This is also the most common statistical convention. However, this code is too simple to work perfectly except in the simplest situation with no tied values and no missing values. For example, looking at data for mpg, we see that the first two values after sorting are tied at 12, yet these are assigned ranks 1 and 2 by the generate command above. In statistics it is common to assign the same rank to each of several tied values such that the sum of the ranks is preserved. Ranks 1 and 2 give a sum of 3 to be preserved, so each would become rank 1.5 instead. We could carry out this reassignment for all tied values ourselves, but it is convenient that it is encapsulated in egen’s rank() function:
. egen rank = rank(mpg)
This function also takes care of any missing values. The sort command would sort any (numeric) missing values to the end of the dataset, so that gen rank = _n would then assign them the highest ranks. Usually, however, a missing value should be matched by a missing rank, and this is what is done by egen, rank(). If you are curious to see how the reassignment for ties is done, have a look inside the code with your favorite text editor (Stata’s own doedit would be fine). Typing
. which _grank
will tell you where the code is on your system. (The prefix _g is generic to all egen functions.)
egen’s rank() function allows separate calculations of ranks for each of several groups defined by a classifying variable:
. bysort foreign: egen rank = rank(mpg)
For more on bysort or, more generally, on by, see the online manual entry or the tutorial in Cox (2002).
Yet another nuance is being able to rank in reverse. Suppose we preferred to assign the highest (and best) value of mpg rank 1. To do this, we exploit the fact that egen, rank() feeds on an expression (denoted exp in the egen syntax diagram), which can be more complicated than a variable name, so we can work with (in particular) negated values. Thus,
. egen rank = rank(-mpg)
would reverse the ranks for us. Finally, before leaving ranks as such, note that egen, rank() also has field, track, and unique options. The name track was suggested by track events such as running in which not only does the lowest time win but two values that tie for first would also both be ranked first equal. (In sports, no one talks about rank 1.5.) Similarly, the name field was suggested by field events such as jumping or throwing in which the greatest distance or height wins, and there is a corresponding rule for ties. unique arbitrarily assigns unique ranks to each of several tied values and is of less concern here.
Coming now to the heart of the matter, consider a sample of seven values, here ordered for convenience:
0 0.57722 1 1.61803 2.71828 3.14159 10
Evidently, the median, the middle of the ordered values, is 1.61803 and should be assigned a percentile rank of 50%, or a plotting position of 0.5, as the value halfway through the distribution. The ranks corresponding to these values are clearly, with the default Stata (and statistical) convention of lowest first, 1 2 3 4 5 6 7. Using i and n to denote rank and number of values, a fraction of i/n would produce plotting positions for this sample of 1/7 ... 7/7, but 4/7 for the middle value. Nor does such a rule treat the tails symmetrically. Similarly, a rule of (i − 1)/n would produce plotting positions 0/7 ... 6/7, but 3/7 for the middle value, and does not treat the tails symmetrically either. From this, one obvious compromise is to split the difference at (i − 0.5)/n. This rule was used by the British scientist Francis Galton (1822–1911), famous for many contributions to statistics, most notably his work on correlation and regression. In 1914 it was introduced into statistical hydrology by Allen Hazen (1869–1930), an American civil engineer in private practice, otherwise best known for his work on urban water supplies.
There is a little literature on choice of plotting positions, focusing variously on their use for estimation of parameters or on testing of hypotheses given specific distributions. Many of the positions commonly recommended are members of a family (i − a)/(n − 2a + 1). Thus, a = 0.5 yields Hazen’s rule; a = 0.375 is a rule suggested by Gunnar Blom (1920–2003), especially for normal probability plots; a = 0 is a rule advocated by Waloddi Weibull (1887–1979) and Emil J. Gumbel (1891–1966). All of these rules yield 0.5 for the single middle value when the sample size is odd. Hazen’s rule is wired into the official Stata command quantile, whereas the Weibull–Gumbel rule is wired into the official Stata commands pnorm, qnorm, pchi, and qchi. For more discussion, see Barnett (1975), Cunnane (1978), or Harter (1984).
Another rule, which for example is often used in spreadsheet software, is a = 1, yielding (i − 1)/(n − 1). This also yields 0.5 for the single middle value when sample size is odd and treats the tails symmetrically. However, the results of 0 and 1 for sample extremes may not be universally suitable. In particular, for probability plots for the normal and other distributions on the whole of the real line quantile functions (inverse distribution functions) do not have finite values for arguments of 0 and 1; thus, sample extremes are not plottable with this rule. This rule in practice is often used with the track rule for ties.
The choice of precise rule, from the possibilities mentioned or from any others that may appeal, is up to the user. All that remains is to calculate the sample size, n. In the simplest circumstance, this is just the built-in _N. To take proper account of missing values, and to be able to work with multiple groups, it is better to use egen, count() to count the numbers of (nonmissing) values.
Suppose that we want Hazen plotting positions for varname:
. egen n = count(varname) . egen i = rank(varname) . gen hazen = (i - 0.5) / n
Or, we want Weibull plotting positions, but separately by byvar:
. by byvar, sort: egen n = count(varname) . by byvar: egen i = rank(varname) . gen weibull = i / (n + 1)
Or, we want spreadsheet-style percent rank:
. by byvar, sort: egen n = count(varname) . by byvar: egen i = rank(varname), track . gen pcrank = (i - 1) / (n - 1)
All of these examples return results between 0 and 1; for percentages, just multiply by 100.
In sum, despite a mass of small details to be thought about, these calculations boil down to the application of two egen functions.