[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: selecting obs while reading in huge data set

From	"Hoetker, Glenn" <[email protected]>
To	<[email protected]>
Subject	RE: st: selecting obs while reading in huge data set
Date	Thu, 19 Aug 2004 10:11:06 -0500

of some amazing things and Jeroen Weesie's mmerge command simplifies
many things that I used to use SAS's SQL command for.  However, with
really large data sets, Stata still lags, sometimes unacceptably.  Using
the odbc command with mysql or a similar SQL database program is a nice
compromise (although I must say that SAS's SQL command is much simpler
than most things SAS does.  Also, their book "SAS Guide to the SQL
Procedure" is quite good and most of it is not SAS specific). 
       
       Glenn Hoetker
       Assistant Professor of Strategy
       College of Business
       University of Illinois at Urbana-Champaign
       217-265-4081
       [email protected]
       
       
       -----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Michael Ingre
Sent: Thursday, August 19, 2004 8:11 AM
To: [email protected]
Subject: Re: st: selecting obs while reading in huge data set
       
       On 2004-08-19, at 10.15, Steve Stillman wrote:
       
       > Sascha,
       > I have a recommendation that I wouldn't usually make.  I have
been
       > recently doing work with matched employer-employee data with
over 30
       > million obs, so we have been running into the same problem as
you.  SAS
       > is much better for large dataset merges than Stata.  In
particular,  
       > proc
       > SQL is remarkably fast at doing these types of merges (likely
because
       > SQL is written with this type of operation in mind).
       >
       > Well there is was, likely the last time I will recommend SAS
over  
       > Stata.
       >
       > Cheers,
       > Steve
       
       Steve and Sascha
       
       There is a solution that does not include SAS. Stata also support
SQL  
       databases. If you set up an ODBC connection to an SQL database
then the  
       -odbc load- command will allow you to load datasets directly from
the  
       SQL server. It is even possible to execute a SQL statement
directly  
       from Stata with -odbc exec("SqlStmt")- or -odbc
sqlfile("filename")-.  
       This way you could merge and load only the observations you are  
       interested in directly from Stata in one command.
       
       Before you can do this you need to download your data to an SQL
server.  
       If you don't have access to one, you could download one for FREE
here:  
       http://dev.mysql.com/downloads/mysql/4.0.html. After that you
need to  
       set up and ODBS driver manager, and an ODBC driver and download
your  
       data.
       
       This is a bit of work but if you plan to do it a lot, it should
be  
       worth it.
       
       I tried it a couple of months ago and it worked very well with
Stata,  
       however, labels are lost and there is only one code for missing
values.  
       Also, saving and storing data is a bit slower than from disc
files.
       
       If you have a Mac I would recommend the Complete MySQL package
that has  
       an easy set up and comes with drivers and additional software:  
       http://www.apple.com/downloads/macosx/unix_open_source/ 
       completemysql.html
       
       Michael
       
       
       *
       *   For searches and help try:
       *   http://www.stata.com/support/faqs/res/findit.html
       *   http://www.stata.com/support/statalist/faq
       *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: RE: generating data sets with specific parameters
Next by Date: Re: st: question on GLLAMM
Previous by thread: Re: st: selecting obs while reading in huge data set
Next by thread: Re: st: selecting obs while reading in huge data set
Index(es):
- Date
- Thread