Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Missed opportunities for Stata I/O
From
David Kantor <[email protected]>
To
[email protected]
Subject
Re: st: Missed opportunities for Stata I/O
Date
Mon, 09 Sep 2013 15:55:07 -0400
At 06:18 PM 9/8/2013, Daniel Feenberg wrote, among many other
interesting things:
I should note that the -in- qualifier isn't as good as it could be. That is:
use med2009 in 1/100
doesn't stop reading at record 100. Instead it seems to read all 143
million records, but then drops the records past 100.
I have noticed this problem myself when loading large files, though
not quite that large. I understand that the reason it reads the
entire file is that the file format puts the value labels at the end.
The file format has several segments, of which the data is the
second-to-last; the final segment holds the value abels. (See -help
dta-.) So to properly load a file, the -use- routine must read
through the entire file. I think that that was a poor choice. (Stata
Corp, please pay attention.) It would have been preferable to place
the data as the final segment, so that all the ancillary information
could be read before the data, and the command...
use med2009 in 1/100
would be able to quit reading after the 100th record; it should take
negligible time.
Alternatively, without changing the file format, it may be possible
to calculate where the value labels are located and skip directly to
that location; whether this is possible may depend on the operating
system. (The recent trend has been to view a file as a stream. This
has some advantages, but has cast aside features such as the ability
to read a specified location directly.)
Note that the assessment that -use- "then drops the records past 100"
may be a bit off-the-mark. I believe that it stores only the first
100; the rest are read but ignored. Also, Daniel's remark is not so
much about the -in- qualifier in general, but about the -in-
qualifier in the -use- command. In all other contexts -- when
addressing data already in memory -- it is very effective.
As long as this problem persists, and if you frequently need that
initial segment (say, for testing of code), then, at the risk of
telling you what you already know, the thing to do is to run that
command once and save the results in a separate file with a distinct
name (e.g., med2009_short).
HTH
--David
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/