Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: how to group variables into equal number groups
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: how to group variables into equal number groups
Date
Tue, 26 Mar 2013 15:25:56 +0000
Thanks to Marcello for the mention, but I think at best that kind of
graph will illustrate the problem, not solve it.
However, the problem is, as I understand it, at root insoluble. There
is a longer discussion in
SJ-12-4 pr0054 . . . . . . . . . . Speaking Stata: Matrices as look-up tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/12 SJ 12(4):748--758 (no commands)
illustrates the use of matrices as look-up tables
but the nub of the matter is a single word: ties!
Here is the example from my paper above. If you want to get the
executive summary now, my advice is
1. Don't use this lousy method. It entails discarding information.
2. If you ignore #'1, it is possible that you might improve on -xtile-
by using a different criterion at bin boundaries.
First, we use a moderately large dataset as example, so no one can
dismiss the phenomenon as characteristic of small datasets.
. webuse nlswork, clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
We use 10 groups.
. xtile q_age=age, nq(10)
We first show that -xtile- is using the same results as -_pctile-:
. _pctile age, nq(10)
. ret li
scalars:
r(r1) = 21
r(r2) = 23
r(r3) = 24
r(r4) = 26
r(r5) = 28
r(r6) = 31
r(r7) = 33
r(r8) = 36
r(r9) = 38
and put these in a matrix:
. matrix q = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9)
What did -xtile- do? This is a long way from equal frequencies! But
clearly if someone is (say) 24, they must be in the same group as
everybody else of the same age.
. tab q_age
10 |
quantiles |
of age | Freq. Percent Cum.
------------+-----------------------------------
1 | 4,122 14.46 14.46
2 | 3,062 10.74 25.20
3 | 1,636 5.74 30.94
4 | 2,980 10.45 41.39
5 | 2,567 9.00 50.39
6 | 3,614 12.68 63.07
7 | 2,357 8.27 71.34
8 | 3,543 12.43 83.76
9 | 1,824 6.40 90.16
10 | 2,805 9.84 100.00
------------+-----------------------------------
Total | 28,510 100.00
We can reproduce that using the results of -_pctile-.
. gen q_age2 = 10 if age < .
(24 missing values generated)
. quietly forval i = 9(-1)1 {
replace q_age2 = `i' if age <= q[1, `i']
}
. assert q_age == q_age2
No news is good news here.
I have _one_ suggestion here (apart from not using this lousy method).
Try a different criterion at the boundary. .
. gen q_age3 = 10 if age < .
(24 missing values generated)
. quietly forval i = 9(-1)1 {
replace q_age3 = `i' if age < q[1, `i']
}
We now have a different classification.
. tab q_age3
q_age3 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,805 9.84 9.84
2 | 2,775 9.73 19.57
3 | 1,604 5.63 25.20
4 | 3,202 11.23 36.43
5 | 2,731 9.58 46.01
6 | 3,662 12.84 58.85
7 | 2,314 8.12 66.97
8 | 3,677 12.90 79.87
9 | 2,067 7.25 87.12
10 | 3,673 12.88 100.00
------------+-----------------------------------
Total | 28,510 100.00
But many people changed decile groups!
. count if q_age != q_age3
11647
. qui tab q_age, matcell(freq)
. qui tab q_age3, matcell(freq3)
. gen freq = freq[_n,1]
(28524 missing values generated)
. gen freq3 = freq3[_n,1]
(28524 missing values generated)
. su freq*
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
freq | 10 2851 788.4865 1636 4122
freq3 | 10 2851 716.4114 1604 3677
At best, we see that the second classification has groups of rather
more equal size (as measured by the SD of group frequency).
Here is the code in one:
webuse nlswork, clear
xtile q_age=age, nq(10)
_pctile age, nq(10)
ret li
matrix q = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9)
tab q_age
gen q_age2 = 10 if age < .
quietly forval i = 9(-1)1 {
replace q_age2 = `i' if age <= q[1, `i']
}
assert q_age == q_age2
gen q_age3 = 10 if age < .
quietly forval i = 9(-1)1 {
replace q_age3 = `i' if age < q[1, `i']
}
tab q_age3
count if q_age != q_age3
qui tab q_age, matcell(freq)
qui tab q_age3, matcell(freq3)
gen freq = freq[_n,1]
gen freq3 = freq3[_n,1]
su freq*
On Tue, Mar 26, 2013 at 2:43 PM, Marcello Pagano
<[email protected]> wrote:
> Try
>
> findit eqprhistogram
>
> it will lead you to Nick Cox's plot of what you are looking for.
>
> m.p.
>
>
>
> On 3/26/2013 10:30 AM, Xixi Lin wrote:
>> I am trying to make independent variables into decile groups, and I
>> used xtile decile=x1 if Period==`z', nq(10); however, it turns out
>> that xtile does not make equal number of the 10 groups, is there any
>> way to force stata to divide them into equal number of obs or almost
>> equal number of obs?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/