Nick Cox
Thank you! As these datasets have millions of observations, any time-saving strategy will be important.
Best,
CM
-----Original Message-----
From: [email protected] on behalf of Nick Cox
Sent: Tue 9/30/2003 7:12 AM
To: [email protected]
Cc:
Subject: st: RE: RE: Short program to "collapse (# unique elements)": Use of nested loops and a "weights not allowed" message
Chih-Mao Hsieh
> > I have a
> > data file with three columns: citing, cited, nclass. For
> > every "citing", there are multiple "cited", and for each
> > "cited" there is a "nclass". The file is sorted by citing,
> > then nclass. I need a program to count the number of
> > unique "nclass" strings associated to each "citing".
> >
> > As a simple example, given the following data file "data.dta":
> >
> > citing cited nclass
> > 100 20 12
> > 100 22 15
> > 100 23 15
> > 101 32 14
> > 101 33 15
> > 101 34 15
> > 101 40 17
> >
> > I need the following output file:
> >
> > citing numpatclass
> > 100 2 [12 and 15 are unique, 15 is
> repeated]
> > 101 3 [14, 15, 17 are unique, 15
> is repeated]
> Phil Ryan gave excellent advice explaining how
> this can be done, without loops, by using -by:-.
>
> In addition, note the FAQ
> How do I compute the number of distinct observations?
> http://www.stata.com/support/faqs/data/distinct.html
> which explains approaches using -by:-, similar in
> spirit to Phil's solution, and also gives manual
> references and references to user-written software
> in this area.
>
> Thus, a canned solution here is
>
> bysort citing : egen numpatclass = nvals(nclass)
> by citing : keep if _n== 1
Another approach is a double -contract-:
contract citing nclass
contract citing, freq(numpatclass)
After the first -contract-, the number
of observations for each value of -citing-
is the number of distinct values of -nclass-
observed for each;
so the second -contract- immediately yields
the desired count variable.
That this solution using -contract- makes
no use of -by:- or -_N- is pure illusion.
Look inside -contract- at the Stata code
-- -contract- is implemented as an .ado --
and you will see that it is based on
exactly the same machinery.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/