Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Missed opportunities for Stata I/O

From	David Kantor <[email protected]>
To	[email protected]
Subject	Re: st: Missed opportunities for Stata I/O
Date	Mon, 09 Sep 2013 15:55:07 -0400

At 06:18 PM 9/8/2013, Daniel Feenberg wrote, among many otherinteresting things:

I should note that the -in- qualifier isn't as good as it could be. That is:

  use med2009 in 1/100
doesn't stop reading at record 100. Instead it seems to read all 143million records, but then drops the records past 100.

I have noticed this problem myself when loading large files, thoughnot quite that large. I understand that the reason it reads theentire file is that the file format puts the value labels at the end.The file format has several segments, of which the data is thesecond-to-last; the final segment holds the value abels. (See -helpdta-.) So to properly load a file, the -use- routine must readthrough the entire file. I think that that was a poor choice. (StataCorp, please pay attention.) It would have been preferable to placethe data as the final segment, so that all the ancillary informationcould be read before the data, and the command...

        use med2009 in 1/100

would be able to quit reading after the 100th record; it should takenegligible time.

Alternatively, without changing the file format, it may be possibleto calculate where the value labels are located and skip directly tothat location; whether this is possible may depend on the operatingsystem. (The recent trend has been to view a file as a stream. Thishas some advantages, but has cast aside features such as the abilityto read a specified location directly.)

Note that the assessment that -use- "then drops the records past 100"may be a bit off-the-mark. I believe that it stores only the first100; the rest are read but ignored. Also, Daniel's remark is not somuch about the -in- qualifier in general, but about the -in-qualifier in the -use- command. In all other contexts -- whenaddressing data already in memory -- it is very effective.

As long as this problem persists, and if you frequently need thatinitial segment (say, for testing of code), then, at the risk oftelling you what you already know, the thing to do is to run thatcommand once and save the results in a separate file with a distinctname (e.g., med2009_short).


HTH
--David

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Missed opportunities for Stata I/O
  - From: Daniel Feenberg <[email protected]>

References:
- st: Missed opportunities for Stata I/O
  - From: Daniel Feenberg <[email protected]>

Prev by Date: Re: st: Features for Stata 14
Next by Date: Re: st: Encode string variables without following the default alphanumeric ordering
Previous by thread: Re: st: Using stata local macro in mata
Next by thread: Re: st: Missed opportunities for Stata I/O
Index(es):
- Date
- Thread