Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: large data sets (was st: A faster way to gsort)

From	"Joseph Coveney" <[email protected]>
To	<[email protected]>
Subject	Re: large data sets (was st: A faster way to gsort)
Date	Thu, 13 Mar 2014 12:54:37 +0900

Jeph Herrin wrote:

Joe claims that Stata seems to be more concerned about performance in 
large data sets these days, and I would like to comment (for those at 
StataCorp who are paying attention, and anyone else who might concur or 
advise) that for me this is at the very top of my wish list for Stata 14.

I am currently working on a series of projects where I have started to 
use SAS almost exclusively for most of the data management, because of 
the following Stata limitations:

1. Stata doesn't do well with large datasets. My datasets are on the 
order of 10gb; I have 24gb of RAM, but if I load a 6gb data set and do 
-merge- or -bysort- it quickly hits the 24gb limit and slows to a crawl. 
When the option is several hours vs 15 minutes in SAS, I have a strong 
incentive to use SAS.

2. Stata does not submit SQL well. I can submit the same query to the 
same database via SAS, SQL client software, and Stata, and while it 
works in the first two, about one in ten times Stata will simply hang 
and do nothing. Yesterday I waited 3 hours for Stata to return a short 
table (82 records) before pasting the same query into SAS and retrieving 
it in several minutes.

3. Stata does not do native SQL at all. That is, I would like to be able 
to use Stata files as tables in combination with queries to external SQL 
servers. SAS supports this, and this allows me to build an analytic 
dataset incrementally, something which is critical when using 'big data' 
- if I had to create the dataset at all at once, it would be on the 
order of 300gb.

--------------------------------------------------------------------------------

I can't suggest anything for #2 other than taking a look at characteristics of
those queries that hang versus those that don't, and sending your findings in to
technical support at StataCorp.  I suspect that it has to do more with the ODBC
driver you're using than it does with Stata, so a satisfactory solution might
not come out of it.  (Here, I'm guessing that SAS and the database client
software don't use the ODBC driver, but rather an OLE DB provider.  If that's
not the case, then maybe there is some kind of kinky interaction of Stata and
the ODBC driver that's not shared by SAS or the client software.)

As for #1, wouldn't additional RAM be cheaper than a SAS license?  And if
you're maxed-out on memory slots, wouldn't even a more powerful workstation be
cheaper than a SAS license?

I don't quite follow #3.  Aren't Stata's data management operations
incremental?  I find a series of Stata's data management commands much easier to
walk through than a single SQL statement stretching for pages.

As for Stata's doing SQL natively, there is a comment to a post on the Stata
Blog similarly calling for Stata to adopt SQL standard syntax.  I know that
Jeff's comment goes beyond that, almost as if to have an ODBC driver or
OLE DB provider for Stata dataset files.  

I like SQL and use it daily, but I wouldn't want StataCorp to expend its finite
development resources in that direction.  I say this for a number of reasons
(for a couple of examples:  the three-valued logic of NULLs and other
peculiarities of SQL; considerations of when ad hoc SQL queries should be
permitted and where upstream data management operations should be manifest for
reasons of efficiency, security and regulatory compliance).  

So, if there's a wish-list poll somewhere for Stata 14, put me down as against
SQL in favor of, say, -strunicode-, -menl-, -mcmc- or something along those
lines.

Joseph Coveney

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: large data sets (was st: A faster way to gsort)
  - From: Jeph Herrin <[email protected]>

References:
- st: A faster way to gsort
  - From: Andrew Maurer <[email protected]>
- Re: st: A faster way to gsort
  - From: Maarten Buis <[email protected]>
- RE: st: A faster way to gsort
  - From: Joe Canner <[email protected]>
- RE: st: A faster way to gsort
  - From: Joe Canner <[email protected]>
- RE: st: A faster way to gsort
  - From: Joe Canner <[email protected]>
- Re: st: A faster way to gsort
  - From: Nick Cox <[email protected]>
- RE: st: A faster way to gsort
  - From: Joe Canner <[email protected]>
- large data sets (was st: A faster way to gsort)
  - From: Jeph Herrin <[email protected]>

Prev by Date: Re: st: How does xtreg calculate clustered standard error?
Next by Date: RE: st: Instrument Validity tests for Heckman 1979
Previous by thread: large data sets (was st: A faster way to gsort)
Next by thread: Re: large data sets (was st: A faster way to gsort)
Index(es):
- Date
- Thread