[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: selecting obs while reading in huge data set

From	Joseph Coveney <[email protected]>
To	Statalist <[email protected]>
Subject	st: Re: selecting obs while reading in huge data set
Date	Thu, 19 Aug 2004 22:15:08 +0900

Steve Stillman wrote:

I have a recommendation that I wouldn't usually make.  I have been
recently doing work with matched employer-employee data with over 30
million obs, so we have been running into the same problem as you.  SAS
is much better for large dataset merges than Stata.  In particular, proc
SQL is remarkably fast at doing these types of merges (likely because
SQL is written with this type of operation in mind).

--------------------------------------------------------------------------------

I agree completely with Steve on all of the points that he makes.  But there 
might be more economical alternatives to SAS that could get the job done in a 
timely manner, too.  As an indication of one possibility, I've created a small 
dummy dataset of only four million observations of a three-character code 
(string variable) for employer and another three-character code for employee.  
I saved that and another small dataset of the first 50 unique empolyers to 
comma-delimited value files and imported them into a relational database using 
a common desktop database management system (RDBMS), indexed the employer 
column in the first table, set the contraints in the tables and used a SQL 
statement in Stata to perform the operation that Sascha wants to do on his 
large dataset.  

Obviously, my illustration is just a pilot-scale version--a mock-up--that has 
less liability to disc-caching; nevertheless, with a 1.99-gigahertz nominal 
clock speed and only 512 megabytes of RAM on a laptop and a less-than-ideal 
operating system (Windows XP), the join took less than seven seconds.

There are open-source RDBMSs available for a variety of operating systems, so 
the cost at that end is probably manageable, even negligible.  There is, of 
course, the cost in learning to operate the RDBMS, but if multi-gigabyte 
datasets are the norm, then that cost might be worth the investment, and no 
worse than learning SAS.  

Probably because of longer-standing usage and familiarity, I do find Stata's 
data management style much more to my liking than that of an RDBMS using SQL, 
so if it were I, then I'd probably do the storage (archiving) in the relational 
database, use only limited SQL statements to get the pertinent subsets into 
Stata at an early step as possible in the analysis process, and do all of the 
work-up in Stata.

Joseph Coveney

-------------------------------------------------------------------------------

. clear

. set more off

. set rmsg on
r; t=0.00 21:35:24

. odbc load, ///
>   exec("SELECT A.company AS A_company, A.employee FROM B INNER JOIN A ON 
B.company = A.company") ///
>   dsn("Pilot Scale") dialog(complete)
r; t=6.62 21:35:31

. exit

end of do-file
r; t=0.00 21:35:31

--------------------------------------------------------------------------------

// Pilot-scale dataset creating do-file
clear
set more off
set memory 80M
local seed = date("2004-08-19", "ymd")
set seed `seed'
set obs 4000000
generate str3 company = char(65 + floor(26 * uniform())) + ///
  char(65 + floor(26 * uniform())) +char(65 + floor(26 * uniform()))
generate str3 employee = string(floor(10 * uniform())) + ///
  char(65 + floor(26 * uniform())) + string(floor(10 * uniform()))
outsheet using C:\Temp\A.csv, comma names
keep in 1/50
keep company
sort company
assert company != company[_n - 1]
outsheet using C:\Temp\B.csv, comma names
exit

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: selecting obs while reading in huge data set
Next by Date: Re: st: collapse (sum) versus egen (sum)
Previous by thread: st: Re: graph prediction of transformed variable from regression
Next by thread: st: changing the order of graph by?
Index(es):
- Date
- Thread