Hi all. I have an issue with large datasets, and am hoping for some
advice on how to best handle it. To simplify the issue somewhat, I have
two data-sets, PATENTS (about 3 million records) and CITATIONS (16
million records).
PATENTS includes the variables PATENT NUMBER, APPLICATION YEAR,
ASSIGNEE. CITATIONS includes just two variables CITED PATENT, CITING
PATENT.
Each patent in PATENTS was potentially cited by one or more other
patents. CITATIONS captures that information: if patents A and B both
cited patent C, CITATIONS would include
CITED PATENT CITING PATENT
C A
C B
In the end, I want take about 300K records from PATENTS and combine them
with information about the patents that cited them. A desired record
would include:
ORIGINAL PATENT NUMBER, ORIGINAL APPLICATION YEAR, ORIGINAL ASSIGNEE,
CITING PATENT NUMBER, CITING PATENT APPLICATION YEAR, CITING PATENT
ASSIGNEE
In SQL, this is pretty easy. I join PATENTS with CITATIONS where
PATENTS=CITED PATENT and then join that back into PATENTS where CITING
PATENT=PATENT.
However, doing this in Stata is proving more challenging. The improved
merge command in version 8 helps a bit, but I'm still having to rename
variables repeatedly, save interim datasets, and sort large datasets in
different ways. Although the end result is well within my memory
limits, I keep running out of memory in the interim steps. J Weesie's
mmerge command at least automates some of the renaming/sorting, but
doesn't address the memory issues or the time involved in re-sorting the
records.
Much as I hate the thought, I'm playing with the idea of doing my data
manipulation in SAS because of the availability of PROC SQL.
Alternatively, I could translate the datasets into, say, Access and use
the new odbc command. However, if I made any interim datasets, I'd have
to translate those back and forth between Access and Stata to use them.
So, my questions for those with more experience & knowledge include.
1. Am I just missing something in using merge, or do these sound like
problems I should expect to encounter with datasets like this?
2. Is there anyway (now or soon) to use the odbc command on a Stata
dataset (i.e., any change of a odbc driver for Stata)? That would be
the dream answer.
3. Anyone see an obvious solution that I'm just missing?
Many thanks for any advice you can supply.
Glenn H.
Glenn Hoetker
Assistant Professor of Strategy
College of Business Administration
University of Illinois at Urbana-Champaign
217-265-4081
[email protected]
"Success is going from failure to failure without a loss of enthusiasm."
Sir Winston Churchill
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/