Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Missed opportunities for Stata I/O
From
Daniel Feenberg <[email protected]>
To
[email protected]
Subject
Re: st: Missed opportunities for Stata I/O
Date
Mon, 9 Sep 2013 20:07:42 -0400 (EDT)
On Mon, 9 Sep 2013, David Kantor wrote:
At 06:18 PM 9/8/2013, Daniel Feenberg wrote, among many other interesting
things:
I should note that the -in- qualifier isn't as good as it could be. That
is:
use med2009 in 1/100
doesn't stop reading at record 100. Instead it seems to read all 143
million records, but then drops the records past 100.
I have noticed this problem myself when loading large files, though not quite
that large. I understand that the reason it reads the entire file is that the
file format puts the value labels at the end. The file format has several
segments, of which the data is the second-to-last; the final segment holds
the value abels. (See -help dta-.) So to properly load a file, the -use-
routine must read through the entire file. I think that that was a poor
choice. (Stata Corp, please pay attention.) It would have been preferable to
place the data as the final segment, so that all the ancillary information
could be read before the data, and the command...
use med2009 in 1/100
would be able to quit reading after the 100th record; it should take
negligible time.
Alternatively, without changing the file format, it may be possible to
calculate where the value labels are located and skip directly to that
location; whether this is possible may depend on the operating system. (The
recent trend has been to view a file as a stream. This has some advantages,
but has cast aside features such as the ability to read a specified location
directly.)
We like to read compressed data from a pipe - so random access to the
using file would be a great disadvantage to us. Other users have used this
feature for encryption, and it has many other uses. I would rather see a
"nolabel" option that would suppress reading the labels. -Append- already
has such an option.
Note that the assessment that -use- "then drops the records past 100" may be
a bit off-the-mark. I believe that it stores only the first 100; the rest are
read but ignored. Also, Daniel's remark is not so much about the -in-
qualifier in general, but about the -in- qualifier in the -use- command. In
all other contexts -- when addressing data already in memory -- it is very
effective.
Yes, that was a thinko - no memory is used by the unused records.
As long as this problem persists, and if you frequently need that initial
segment (say, for testing of code), then, at the risk of telling you what you
already know, the thing to do is to run that command once and save the
results in a separate file with a distinct name (e.g., med2009_short).
There is a workaround for every problem, of course. In or SAS
implementation of this system we maintain .01%, 5%, 20% and 100% subsets
to satisfy different levels of user patience, but it would be nice to
avoid that extra complication. In fact every comment in my posting was
about avoiding a complication.
I posted a slightly revised version of my comments as the latest entry
in my collection of pieces on working with large datasets. It is at
http://www.nber.org/stata/efficient
It now includes a link to David's insightful explanation of why -merge-
takes so much memory from a long-ago Statalist.
Daniel Feenberg
NBER
HTH
--David
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/