Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Missed opportunities for Stata I/O
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: Missed opportunities for Stata I/O
Date
Tue, 10 Sep 2013 14:07:52 -0400
FWIW, we can keep an eye out for some new competition setting new
standards, e.g. Google BigQuery doing correlations now, perhaps more
coming one day: http://www.youtube.com/watch?v=tqS4vZ2Rxlo
I know, I know, it is very different to put your data on servers and
run SQL queries, and impossible for confidential data. But if we are
also talking about reasonable speed benchmarks StataCorp should have
in the crosshairs in the long run, this is pushing the production
possibility frontier…
Maybe there are more opportunities to distribute more work on servers
or clusters or GPU cores — as well as new data architectures, like
monet.db (works for R!).
On Mon, Sep 9, 2013 at 8:07 PM, Daniel Feenberg <[email protected]> wrote:
>
> On Mon, 9 Sep 2013, David Kantor wrote:
>
>> At 06:18 PM 9/8/2013, Daniel Feenberg wrote, among many other interesting
>> things:
>>
>>> I should note that the -in- qualifier isn't as good as it could be. That
>>> is:
>>>
>>> use med2009 in 1/100
>>>
>>> doesn't stop reading at record 100. Instead it seems to read all 143
>>> million records, but then drops the records past 100.
>>
>>
>> I have noticed this problem myself when loading large files, though not
>> quite that large. I understand that the reason it reads the entire file is
>> that the file format puts the value labels at the end. The file format has
>> several segments, of which the data is the second-to-last; the final segment
>> holds the value abels. (See -help dta-.) So to properly load a file, the
>> -use- routine must read through the entire file. I think that that was a
>> poor choice. (Stata Corp, please pay attention.) It would have been
>> preferable to place the data as the final segment, so that all the ancillary
>> information could be read before the data, and the command...
>> use med2009 in 1/100
>> would be able to quit reading after the 100th record; it should take
>> negligible time.
>>
>> Alternatively, without changing the file format, it may be possible to
>> calculate where the value labels are located and skip directly to that
>> location; whether this is possible may depend on the operating system. (The
>> recent trend has been to view a file as a stream. This has some advantages,
>> but has cast aside features such as the ability to read a specified location
>> directly.)
>>
>
> We like to read compressed data from a pipe - so random access to the using
> file would be a great disadvantage to us. Other users have used this feature
> for encryption, and it has many other uses. I would rather see a "nolabel"
> option that would suppress reading the labels. -Append- already has such an
> option.
>
>
>
>> Note that the assessment that -use- "then drops the records past 100" may
>> be a bit off-the-mark. I believe that it stores only the first 100; the rest
>> are read but ignored. Also, Daniel's remark is not so much about the -in-
>> qualifier in general, but about the -in- qualifier in the -use- command. In
>> all other contexts -- when addressing data already in memory -- it is very
>> effective.
>
>
> Yes, that was a thinko - no memory is used by the unused records.
>
>
>>
>> As long as this problem persists, and if you frequently need that initial
>> segment (say, for testing of code), then, at the risk of telling you what
>> you already know, the thing to do is to run that command once and save the
>> results in a separate file with a distinct name (e.g., med2009_short).
>>
>
> There is a workaround for every problem, of course. In or SAS implementation
> of this system we maintain .01%, 5%, 20% and 100% subsets to satisfy
> different levels of user patience, but it would be nice to avoid that extra
> complication. In fact every comment in my posting was about avoiding a
> complication.
>
> I posted a slightly revised version of my comments as the latest entry
> in my collection of pieces on working with large datasets. It is at
>
> http://www.nber.org/stata/efficient
>
> It now includes a link to David's insightful explanation of why -merge-
> takes so much memory from a long-ago Statalist.
>
> Daniel Feenberg
> NBER
>
>
>
>> HTH
>> --David
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/