Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: large data sets (was st: A faster way to gsort)
From
"Joseph Coveney" <[email protected]>
To
<[email protected]>
Subject
Re: large data sets (was st: A faster way to gsort)
Date
Thu, 13 Mar 2014 12:54:37 +0900
Jeph Herrin wrote:
Joe claims that Stata seems to be more concerned about performance in
large data sets these days, and I would like to comment (for those at
StataCorp who are paying attention, and anyone else who might concur or
advise) that for me this is at the very top of my wish list for Stata 14.
I am currently working on a series of projects where I have started to
use SAS almost exclusively for most of the data management, because of
the following Stata limitations:
1. Stata doesn't do well with large datasets. My datasets are on the
order of 10gb; I have 24gb of RAM, but if I load a 6gb data set and do
-merge- or -bysort- it quickly hits the 24gb limit and slows to a crawl.
When the option is several hours vs 15 minutes in SAS, I have a strong
incentive to use SAS.
2. Stata does not submit SQL well. I can submit the same query to the
same database via SAS, SQL client software, and Stata, and while it
works in the first two, about one in ten times Stata will simply hang
and do nothing. Yesterday I waited 3 hours for Stata to return a short
table (82 records) before pasting the same query into SAS and retrieving
it in several minutes.
3. Stata does not do native SQL at all. That is, I would like to be able
to use Stata files as tables in combination with queries to external SQL
servers. SAS supports this, and this allows me to build an analytic
dataset incrementally, something which is critical when using 'big data'
- if I had to create the dataset at all at once, it would be on the
order of 300gb.
--------------------------------------------------------------------------------
I can't suggest anything for #2 other than taking a look at characteristics of
those queries that hang versus those that don't, and sending your findings in to
technical support at StataCorp. I suspect that it has to do more with the ODBC
driver you're using than it does with Stata, so a satisfactory solution might
not come out of it. (Here, I'm guessing that SAS and the database client
software don't use the ODBC driver, but rather an OLE DB provider. If that's
not the case, then maybe there is some kind of kinky interaction of Stata and
the ODBC driver that's not shared by SAS or the client software.)
As for #1, wouldn't additional RAM be cheaper than a SAS license? And if
you're maxed-out on memory slots, wouldn't even a more powerful workstation be
cheaper than a SAS license?
I don't quite follow #3. Aren't Stata's data management operations
incremental? I find a series of Stata's data management commands much easier to
walk through than a single SQL statement stretching for pages.
As for Stata's doing SQL natively, there is a comment to a post on the Stata
Blog similarly calling for Stata to adopt SQL standard syntax. I know that
Jeff's comment goes beyond that, almost as if to have an ODBC driver or
OLE DB provider for Stata dataset files.
I like SQL and use it daily, but I wouldn't want StataCorp to expend its finite
development resources in that direction. I say this for a number of reasons
(for a couple of examples: the three-valued logic of NULLs and other
peculiarities of SQL; considerations of when ad hoc SQL queries should be
permitted and where upstream data management operations should be manifest for
reasons of efficiency, security and regulatory compliance).
So, if there's a wish-list poll somewhere for Stata 14, put me down as against
SQL in favor of, say, -strunicode-, -menl-, -mcmc- or something along those
lines.
Joseph Coveney
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/