I have comments on two levels.
First, on how to do this. As always, it is easiest for list members to
see code in terms of datasets everyone can use.
Your first bit seems rather indirect. I would use -centile- instead.
Individual percentiles are left behind in memory as r class results by
-centile-. Thus you need not put them into a variable and then take them
out again, or create any variables you only need for one purpose.
. sysuse auto
. centile weight, centile(70)
. gen byte weight_group = weight > r(c_1) if weight < .
Then you can proceed directly to something like
. egen mpg_group = xtile(mpg), by(weight_group) nq(3)
. egen both_group = group(mpg_group weight_group) label
Remember the request to explain where non-official commands you use come
from. Thus -egen, xtile()- is a user-written function (by Ulrich Kohler)
in the -egenmore- package on SSC.
Extending this to two percentiles:
. centile weight, centile(30 70)
. gen byte weight_group = cond(weight < r(c_1), 1,
cond(weight < r(c_2), 2, 3)) if weight < .
and you can proceed as before
. egen mpg_group = xtile(mpg), by(weight_group) nq(3)
. egen both_group = group(mpg_group weight_group) label
Note that in the auto dataset there are not in fact any missing values
for
-weight- but excluding them explicitly is usually going to be the right
thing in most problems, and at worst does nothing. In fact, with two
variables, a double restriction
... if weight < . & mpg < .
is usually going to be the right thing, and at worst it does nothing and
will not bite.
Second, on why you are doing this. It may be impertinent, but I am
curious. Under what circumstances must you do precisely this?
Categorisation by quantiles throws away data. Seemingly arbitrary
quantiles or numbers of quantiles do that capriciously. When is this the
right thing to do in any data analysis?
Nick
[email protected]
Rajesh Tharyan
==============
I have two variables x and y, which I have to put into 6 groups.
I am using the code below (code I) to first cut the x variable into 2
groups based on its 70th percentile value. And then, for each group of
the x variable I cut the y variable into 3 equal groups, and finally put
the two together to form the final six groups.
What I would like to do is cut the y variable for each group of x based
on the 30th and 70th percentile value. The code (Code II) below is my
present solution and it seems very complicated. Any suggestions are very
much appreciated. IS it possible to cut at specified percentiles?
Code I
*************start********************
* this bit cuts the x variable into two groups based on the 70th
percentile value
pctile xu=x, nq(10) genp(xx)
replace xu=. if xx~=70
sort xu (Is this step necessary? I get slightly different numbers if I
sort and when I do not sort for example for one group I get 481 with and
477 without sorting)
xtile xc = x, cutpoints(xu)
drop xx xu
* this bits cuts the y variable into three groups for each group of x
egen yc=xtile(y), by(xc) nq(3)
* forming the final 6 groups
gen gp=10*xc+yc
****************end*******************
Code II
************start*********
pctile xu=x, nq(10) genp(xx)
replace xu=. if xx~=70
sort xu
xtile xc = x, cutpoints(xu)
drop xx xu
pctile xmmu=y if xc==1, nq(10) genp(yy)
replace xmmu=. if yy~=30 & yy~=70
pctile xmmcu1=y if xc==2, nq(10) genp(yy1)
replace xmmcu1=. if yy1~=30 & yy1~=70
xtile yc=y if mc==1, cutpoints(xmmu)
xtile yc1=y if mc==2, cutpoints(xmmu1)
replace yc=yc1 if yc==. & xc==2
drop xmmu xmmu1 yc1 yy yy1
gen gp=10*xc+yc
***********end************
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/