Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Joseph Coveney" <stajc2@gmail.com> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | Re: large data sets (was st: A faster way to gsort) |
Date | Thu, 13 Mar 2014 12:54:37 +0900 |
Jeph Herrin wrote: Joe claims that Stata seems to be more concerned about performance in large data sets these days, and I would like to comment (for those at StataCorp who are paying attention, and anyone else who might concur or advise) that for me this is at the very top of my wish list for Stata 14. I am currently working on a series of projects where I have started to use SAS almost exclusively for most of the data management, because of the following Stata limitations: 1. Stata doesn't do well with large datasets. My datasets are on the order of 10gb; I have 24gb of RAM, but if I load a 6gb data set and do -merge- or -bysort- it quickly hits the 24gb limit and slows to a crawl. When the option is several hours vs 15 minutes in SAS, I have a strong incentive to use SAS. 2. Stata does not submit SQL well. I can submit the same query to the same database via SAS, SQL client software, and Stata, and while it works in the first two, about one in ten times Stata will simply hang and do nothing. Yesterday I waited 3 hours for Stata to return a short table (82 records) before pasting the same query into SAS and retrieving it in several minutes. 3. Stata does not do native SQL at all. That is, I would like to be able to use Stata files as tables in combination with queries to external SQL servers. SAS supports this, and this allows me to build an analytic dataset incrementally, something which is critical when using 'big data' - if I had to create the dataset at all at once, it would be on the order of 300gb. -------------------------------------------------------------------------------- I can't suggest anything for #2 other than taking a look at characteristics of those queries that hang versus those that don't, and sending your findings in to technical support at StataCorp. I suspect that it has to do more with the ODBC driver you're using than it does with Stata, so a satisfactory solution might not come out of it. (Here, I'm guessing that SAS and the database client software don't use the ODBC driver, but rather an OLE DB provider. If that's not the case, then maybe there is some kind of kinky interaction of Stata and the ODBC driver that's not shared by SAS or the client software.) As for #1, wouldn't additional RAM be cheaper than a SAS license? And if you're maxed-out on memory slots, wouldn't even a more powerful workstation be cheaper than a SAS license? I don't quite follow #3. Aren't Stata's data management operations incremental? I find a series of Stata's data management commands much easier to walk through than a single SQL statement stretching for pages. As for Stata's doing SQL natively, there is a comment to a post on the Stata Blog similarly calling for Stata to adopt SQL standard syntax. I know that Jeff's comment goes beyond that, almost as if to have an ODBC driver or OLE DB provider for Stata dataset files. I like SQL and use it daily, but I wouldn't want StataCorp to expend its finite development resources in that direction. I say this for a number of reasons (for a couple of examples: the three-valued logic of NULLs and other peculiarities of SQL; considerations of when ad hoc SQL queries should be permitted and where upstream data management operations should be manifest for reasons of efficiency, security and regulatory compliance). So, if there's a wish-list poll somewhere for Stata 14, put me down as against SQL in favor of, say, -strunicode-, -menl-, -mcmc- or something along those lines. Joseph Coveney * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/