Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Missed opportunities for Stata I/O

From	Daniel Feenberg <[email protected]>
To	[email protected]
Subject	Re: st: Missed opportunities for Stata I/O
Date	Mon, 9 Sep 2013 20:07:42 -0400 (EDT)


On Mon, 9 Sep 2013, David Kantor wrote:

At 06:18 PM 9/8/2013, Daniel Feenberg wrote, among many other interestingthings:
I should note that the -in- qualifier isn't as good as it could be. Thatis:
  use med2009 in 1/100
doesn't stop reading at record 100. Instead it seems to read all 143million records, but then drops the records past 100.
I have noticed this problem myself when loading large files, though not quitethat large. I understand that the reason it reads the entire file is that thefile format puts the value labels at the end. The file format has severalsegments, of which the data is the second-to-last; the final segment holdsthe value abels. (See -help dta-.) So to properly load a file, the -use-routine must read through the entire file. I think that that was a poorchoice. (Stata Corp, please pay attention.) It would have been preferable toplace the data as the final segment, so that all the ancillary informationcould be read before the data, and the command...
       use med2009 in 1/100
would be able to quit reading after the 100th record; it should takenegligible time.
Alternatively, without changing the file format, it may be possible tocalculate where the value labels are located and skip directly to thatlocation; whether this is possible may depend on the operating system. (Therecent trend has been to view a file as a stream. This has some advantages,but has cast aside features such as the ability to read a specified locationdirectly.)

We like to read compressed data from a pipe - so random access to theusing file would be a great disadvantage to us. Other users have used thisfeature for encryption, and it has many other uses. I would rather see a"nolabel" option that would suppress reading the labels. -Append- alreadyhas such an option.

Note that the assessment that -use- "then drops the records past 100" may bea bit off-the-mark. I believe that it stores only the first 100; the rest areread but ignored. Also, Daniel's remark is not so much about the -in-qualifier in general, but about the -in- qualifier in the -use- command. Inall other contexts -- when addressing data already in memory -- it is veryeffective.


Yes, that was a thinko - no memory is used by the unused records.

As long as this problem persists, and if you frequently need that initialsegment (say, for testing of code), then, at the risk of telling you what youalready know, the thing to do is to run that command once and save theresults in a separate file with a distinct name (e.g., med2009_short).

There is a workaround for every problem, of course. In or SASimplementation of this system we maintain .01%, 5%, 20% and 100% subsetsto satisfy different levels of user patience, but it would be nice toavoid that extra complication. In fact every comment in my posting wasabout avoiding a complication.


I posted a slightly revised version of my comments as the latest entry
in my collection of pieces on working with large datasets. It is at

  http://www.nber.org/stata/efficient

It now includes a link to David's insightful explanation of why -merge-takes so much memory from a long-ago Statalist.


Daniel Feenberg
NBER

HTH
--David

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Missed opportunities for Stata I/O
  - From: László Sándor <[email protected]>

References:
- st: Missed opportunities for Stata I/O
  - From: Daniel Feenberg <[email protected]>
- Re: st: Missed opportunities for Stata I/O
  - From: David Kantor <[email protected]>

Prev by Date: st: IV estimation with a count-data endogenous regressor
Next by Date: Re: st: Different results for confidence intervals for proportions using "svy: proportion" and "espost svy: tabulate" ??
Previous by thread: Re: st: Missed opportunities for Stata I/O
Next by thread: Re: st: Missed opportunities for Stata I/O
Index(es):
- Date
- Thread