Chih-Mao Hsieh
>
> I have a two-column file with variables "citing" and
> "cited". "Citing" refers to a patent, and "cited" refers
> to a patent that is "cited" by the "citing" patent.
> Therefore, if a patent cites and therefore "recombines" 3
> patents prior to it, this history shows up as 3 rows (end
> of message has examples).
>
> I need a program to catch the number of times that the
> exact same set of patents has been "recombined" in the past
> (i.e. imagine trying to find all the papers that cite the
> same set of references that you do in one of your papers!).
>
> The basic solution I have come up with is the following:
>
> collapse (mean) mean=cited (sum) sum=cited (sd) sd=cited, by(citing)
> bysort mean sum sd: gen byte counter = _n
> replace counter=counter-1
>
> It seems to work, and as the datafile has 16 million rows,
> with 3 million unique "citing" numbers -- therefore with a
> fair amount of variance -- I believe it may be good enough.
> My questions are: (1) Is there a more accurate way, if
> less efficient, to do what I need? (2) Is there any reason
> I should expect Stata to calculate means, sums, and sd's in
> different ways from row to row (i.e. rounding) that would
> render totally ineffective my specific use of -collapse-?
> I attach an example below.
>
> Thanks, --Chihmao
>
> ------------------------------------------
>
> citing cited
> 100 30
> 100 32
> 100 33
> 101 34
> 101 35
> 105 30
> 105 32
> 105 33
> 106 29
> 106 30
> 108 30
> 108 32
> 108 33
>
> Desired output:
>
> citing counter
> 100 0
> 101 0
> 105 1 (since #100 cited the exact same list
> of patents, no more, no less)
> 106 0
> 108 2 (since there are now 2 prior
> occurrences of same patent list: #100 and #105)
You are aware that this is a bit of a fudge.
I'd restructure the data like this:
gen allcited = ""
bysort citing (cited) : replace allcited = allcited[_n-1] + " " + cited
by citing : keep if _n == _N
bysort allcited (citing) : gen counter = _n - 1
sort citing
Now this depends on your not overflowing the length
limits of a string variable.
You could save some space by
egen cited2 = group(cited)
and then using -cited2-.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/