Steve Stillman wrote:
I have a recommendation that I wouldn't usually make. I have been
recently doing work with matched employer-employee data with over 30
million obs, so we have been running into the same problem as you. SAS
is much better for large dataset merges than Stata. In particular, proc
SQL is remarkably fast at doing these types of merges (likely because
SQL is written with this type of operation in mind).
--------------------------------------------------------------------------------
I agree completely with Steve on all of the points that he makes. But there
might be more economical alternatives to SAS that could get the job done in a
timely manner, too. As an indication of one possibility, I've created a small
dummy dataset of only four million observations of a three-character code
(string variable) for employer and another three-character code for employee.
I saved that and another small dataset of the first 50 unique empolyers to
comma-delimited value files and imported them into a relational database using
a common desktop database management system (RDBMS), indexed the employer
column in the first table, set the contraints in the tables and used a SQL
statement in Stata to perform the operation that Sascha wants to do on his
large dataset.
Obviously, my illustration is just a pilot-scale version--a mock-up--that has
less liability to disc-caching; nevertheless, with a 1.99-gigahertz nominal
clock speed and only 512 megabytes of RAM on a laptop and a less-than-ideal
operating system (Windows XP), the join took less than seven seconds.
There are open-source RDBMSs available for a variety of operating systems, so
the cost at that end is probably manageable, even negligible. There is, of
course, the cost in learning to operate the RDBMS, but if multi-gigabyte
datasets are the norm, then that cost might be worth the investment, and no
worse than learning SAS.
Probably because of longer-standing usage and familiarity, I do find Stata's
data management style much more to my liking than that of an RDBMS using SQL,
so if it were I, then I'd probably do the storage (archiving) in the relational
database, use only limited SQL statements to get the pertinent subsets into
Stata at an early step as possible in the analysis process, and do all of the
work-up in Stata.
Joseph Coveney
-------------------------------------------------------------------------------
. clear
. set more off
. set rmsg on
r; t=0.00 21:35:24
. odbc load, ///
> exec("SELECT A.company AS A_company, A.employee FROM B INNER JOIN A ON
B.company = A.company") ///
> dsn("Pilot Scale") dialog(complete)
r; t=6.62 21:35:31
. exit
end of do-file
r; t=0.00 21:35:31
--------------------------------------------------------------------------------
// Pilot-scale dataset creating do-file
clear
set more off
set memory 80M
local seed = date("2004-08-19", "ymd")
set seed `seed'
set obs 4000000
generate str3 company = char(65 + floor(26 * uniform())) + ///
char(65 + floor(26 * uniform())) +char(65 + floor(26 * uniform()))
generate str3 employee = string(floor(10 * uniform())) + ///
char(65 + floor(26 * uniform())) + string(floor(10 * uniform()))
outsheet using C:\Temp\A.csv, comma names
keep in 1/50
keep company
sort company
assert company != company[_n - 1]
outsheet using C:\Temp\B.csv, comma names
exit
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/