Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Missed opportunities for Stata I/O


From   David Kantor <kantor.d@att.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Missed opportunities for Stata I/O
Date   Mon, 09 Sep 2013 15:55:07 -0400

At 06:18 PM 9/8/2013, Daniel Feenberg wrote, among many other interesting things:
I should note that the -in- qualifier isn't as good as it could be. That is:

  use med2009 in 1/100

doesn't stop reading at record 100. Instead it seems to read all 143 million records, but then drops the records past 100.
I have noticed this problem myself when loading large files, though 
not quite that large. I understand that the reason it reads the 
entire file is that the file format puts the value labels at the end. 
The file format has several segments, of which the data is the 
second-to-last; the final segment holds the value abels. (See -help 
dta-.) So to properly load a file, the -use- routine must read 
through the entire file. I think that that was a poor choice. (Stata 
Corp, please pay attention.) It would have been preferable to place 
the data as the final segment, so that all the ancillary information 
could be read before the data, and the command...
        use med2009 in 1/100
would be able to quit reading after the 100th record; it should take negligible time.
Alternatively, without changing the file format, it may be possible 
to calculate where the value labels are located and skip directly to 
that location; whether this is possible may depend on the operating 
system. (The recent trend has been to view a file as a stream. This 
has some advantages, but has cast aside features such as the ability 
to read a specified location directly.)
Note that the assessment that -use- "then drops the records past 100" 
may be a bit off-the-mark. I believe that it stores only the first 
100; the rest are read but ignored. Also, Daniel's remark is not so 
much about the -in- qualifier in general, but about the -in- 
qualifier in the -use- command. In all other contexts -- when 
addressing data already in memory -- it is very effective.
As long as this problem persists, and if you frequently need that 
initial segment (say, for testing of code), then, at the risk of 
telling you what you already know, the thing to do is to run that 
command once and save the results in a separate file with a distinct 
name (e.g., med2009_short).
HTH
--David

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index