I mentioned one simplification which improves
the problem, namely the use of -egen, group()- to map
to integers 1 up.
I was toying with an idea of mapping them to
successive primes and computing the product,
but Stata, not surprisingly, has no built-in
-prime()- function to generate successive primes.
Also, in principle, that wouldn't be a solution
either as the largest such product would, I
guess, be too big to handle in any case.
Nick
[email protected]
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of
> Chih-Mao Hsieh
> Sent: 30 September 2003 16:23
> To: [email protected]
> Subject: st: RE: RE: Using -collapse- extensively to find
> historical,
> irregular matches: Better way?
>
>
> Nick, thanks for your response.
>
> I had been shying away from converting "cited" to strings
> because the numbers are in the millions, i.e. strings would
> be length 7. Many of the "citing" patents have more than
> 35-40 "cited" patents, and so the concatenation might
> surpass the string's length limit.
>
> Of course, the chances are not high that two patents would
> match each other over the first 35 patents, so your way
> does appear to be better.
>
> Cheers, --Chihmao
>
> -----Original Message-----
> From: [email protected] on behalf
> of Nick Cox
> Sent: Tue 9/30/2003 9:43 AM
> To: [email protected]
> Cc:
> Subject: st: RE: Using -collapse- extensively to find
> historical, irregular matches: Better way?
>
>
>
> Chih-Mao Hsieh
> >
> > I have a two-column file with variables "citing" and
> > "cited". "Citing" refers to a patent, and "cited" refers
> > to a patent that is "cited" by the "citing" patent.
> > Therefore, if a patent cites and therefore "recombines" 3
> > patents prior to it, this history shows up as 3 rows (end
> > of message has examples).
> >
> > I need a program to catch the number of times that the
> > exact same set of patents has been "recombined" in the past
> > (i.e. imagine trying to find all the papers that cite the
> > same set of references that you do in one of your papers!).
> >
> > The basic solution I have come up with is the following:
> >
> > collapse (mean) mean=cited (sum) sum=cited (sd)
> sd=cited, by(citing)
> > bysort mean sum sd: gen byte counter = _n
> > replace counter=counter-1
> >
> > It seems to work, and as the datafile has 16 million rows,
> > with 3 million unique "citing" numbers -- therefore with a
> > fair amount of variance -- I believe it may be good enough.
> > My questions are: (1) Is there a more accurate way, if
> > less efficient, to do what I need? (2) Is there any reason
> > I should expect Stata to calculate means, sums, and sd's in
> > different ways from row to row (i.e. rounding) that would
> > render totally ineffective my specific use of -collapse-?
> > I attach an example below.
> >
> > Thanks, --Chihmao
> >
> > ------------------------------------------
> >
> > citing cited
> > 100 30
> > 100 32
> > 100 33
> > 101 34
> > 101 35
> > 105 30
> > 105 32
> > 105 33
> > 106 29
> > 106 30
> > 108 30
> > 108 32
> > 108 33
> >
> > Desired output:
> >
> > citing counter
> > 100 0
> > 101 0
> > 105 1 (since #100 cited the exact same list
> > of patents, no more, no less)
> > 106 0
> > 108 2 (since there are now 2 prior
> > occurrences of same patent list: #100 and #105)
>
> You are aware that this is a bit of a fudge.
>
> I'd restructure the data like this:
>
> gen allcited = ""
> bysort citing (cited) : replace allcited =
> allcited[_n-1] + " " + cited
> by citing : keep if _n == _N
> bysort allcited (citing) : gen counter = _n - 1
> sort citing
>
> Now this depends on your not overflowing the length
> limits of a string variable.
>
> You could save some space by
>
> egen cited2 = group(cited)
>
> and then using -cited2-.
>
> Nick
> [email protected]
>
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/