Title | Calculate percentiles with survey data | |
Author | Nini Zang, StataCorp |
When we have survey data, we can still use pctile or _pctile to get percentiles. This is the case because survey characteristics, other than pweights, affect only the variance estimation. Therefore, point estimation of the percentile for survey data can be obtained with pctile or _pctile with pweights.
I will start by presenting an example on how _pctile works with survey data.
. sysuse auto (1978 Automobile Data) . rename mpg psu . rename length strata . keep price psu strata weight . keep in 1/4 (70 observations deleted) . svyset psu [pweight=weight], strata(strata) pweight: weight VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero> . _pctile price [pweight=weight], p(10) . return list scalars: r(r1) = 3799
As we already know, a percentile is the value of a variable below which a certain percentage of observations fall. So the 10th percentile is the value below which 10% of the observations may be found. Although we have survey structures—such as strata, PSU, and pweights—the percentiles are only affected by pweights. Let’s look at the formula of pctile or _pctile we use in Stata.
Let x(j) refer to the x in ascending order for j = 1, 2, ..., n. Let w(j) refer to the corresponding weights of x(j); if there are no weights, w(j) = 1. Let N = Σnj=1w(j). To obtain the pth percentile, which we will denote as x[p], we need to find the first index i such that W(i) > P, where P = N * p/100 and W(i) = Σij=1w(j).
The pth percentile is then
{ | x(i−1) + x(i) | |||
x[p] = | 2 | If w(i−1) = P | ||
x(i) | otherwise |
From above, we can see that the calculation of a percentile is only associated with weights and observations.
Let’s manually calculate the percentile obtained above with _pctile. We first sort the data:
. sort price . list +-------------------------------+ | price psu weight strata | |-------------------------------| 1. | 3,799 22 2,640 168 | 2. | 4,099 22 2,930 186 | 3. | 4,749 17 3,350 173 | 4. | 4,816 20 3,250 196 | +-------------------------------+
Let
price(j) = the variable price in ascending order for j = 1, 2, 3, 4
weight(j) = the corresponding weights
price[10] = 10th percentile of price
We generate variable w, cumulative. Sum of weight:
. generate w=sum(weight) . list +---------------------------------------+ | price psu weight strata w | |---------------------------------------| 1. | 3,799 22 2,640 168 2640 | 2. | 4,099 22 2,930 186 5570 | 3. | 4,749 17 3,350 173 8920 | 4. | 4,816 20 3,250 196 12170 | +---------------------------------------+
Then, N = Σ4j=1weight(j) = 2640 + 2930 + 3350 + 3250 = 12170 and P = N * p/100 = (12170 * 10)/100 = 1217. To obtain the 10th percentile, we must find the first index i such that W(i) > 1217. When index i =1, we can see W(1) = 2640, which is greater than 1217. Thus the 10th percentile price[10] is equal to price(1); that is, the price[10] = 3799.
We can also estimate the median from survey data by using summarize with aweights.
. sysuse auto, clear (1978 Automobile Data) . rename mpg psu . rename length strata . keep price psu strata weight . keep in 1/4 (70 observations deleted) . svyset psu [pweight=weight], strata(strata) pweight: weight VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero> . summarize price [aweight=weight], detail Price ------------------------------------------------------------- Percentiles Smallest 1% 3799 3799 5% 3799 4099 10% 3799 4749 Obs 4 25% 4099 4816 Sum of Wgt. 12170 50% 4749 Mean 4404.32 Largest Std. Dev. 489.7492 75% 4816 3799 90% 4816 4099 Variance 239854.3 95% 4816 4749 Skewness -.3284718 99% 4816 4816 Kurtosis 1.321737
From above, we can see that the median of price is equal to 4749. The 10th percentile of price is equal to 3799, which is the same result that we obtained with _pctile and pweights.