Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Missed opportunities for Stata I/O
From
Daniel Feenberg <[email protected]>
To
[email protected]
Subject
st: Missed opportunities for Stata I/O
Date
Sun, 8 Sep 2013 18:18:59 -0400 (EDT)
While Statamp can make estimation very fast I feel there are some
important missed opportunities in Stata I/O that may not be sexy,
but which Amdahl's law makes increasingly important.
In our work datasets are often tens of gigabytes, and sometimes hundreds
of gigabytes when multiple years of Medicare data are combined.
First, the good news. For example, the -use- statement takes variable
lists and if qualifiers which can dramatically speed up input if only a
fraction of the data is needed. They also reduce core usage. The varlists
and -if- qualifiers provide an order of magnitude improvement in speed in
typical applications here.
Now the disappointments. The -append- statement doesn't take the -if-
qualifier, though it does take a varlist. The upshot of this is that what
could be a simple:
foreach `year'=2001/2010 {
append med`year' if diagnosis=="ami", keep( varlist)
}
becomes instead
foreach `year'=2001/2010 {
clear
use id diagnosis using med`year' if diagnosis=="ami"
save ami`year'
}
foreach `year'=2002/2010 {
append od diagnosis using ami`year'
}
unless you have enough memory to hold the entire dataset in memory, and
the patience to wait for it to load.
-merge- statements are quite slow compared to -use-. Our fairly ordinary
Linux boxes can read 3.4 million rows per second of 10 floats. Merging
that with a single variable in the workspace runs at only a tenth that
speed. If only one variable is kept (varlist), or only a tiny percentage
of the using rows are kept (-keep(match)-) the speed can be partially
restored to about 1.2 million rows/second. It is possible that something
about the way data is stored internally makes this impossible to improve,
but it is unfortunate.
More than the time element, the limitations of the -merge- statement make
for complicated programming. Suppose there is in core a list of patients
with an AMI, and you wish to merge in the doctors visits of those patients
from the annual op (out-patient) files. You might hope to do this:
forvalues `year'=2002/2010 {
merge 1:m id using op`year', keep(match)
}
But that doesn't work because after the first merge, there are duplicate
ids in core (for multiple doctors visits in the first year). The best
workaround I can come up with is:
forvalues `year'=2002/2010 {
clear
use ami
merge 1:m id using op`year',keep(match)
save ami`year',replace
}
forvalues `year'=2002/2010 {
append using ami`year'
}
-append- allows multiple files to be concatenated, but as far as I can
tell -merge- doesn't allow them to be joined.
The -save- command is much more restriced than other I/O commands - no
varlist, -if- or -in- support. So dividing a file into subsets requires
rereading the file for each subset. For example instead of:
forvalues state=1/50 (
save state`state' if state=`i'
}
we have:
forvalues state=1/50 {
clear
use file if state==`i'
save state`i'
}
which is needlessly slower and more complex.
Suprisingly the commands -infix-, -fdause- and -fdasave- allow -if-, -in-
and a varlist, while -insheet-, -outsheet- and -save- don't allow any of
those.
I should note that the -in- qualifier isn't as good as it could be. That
is:
use med2009 in 1/100
doesn't stop reading at record 100. Instead it seems to read all 143
million records, but then drops the records past 100.
I am aware that most users of Stata have only a few thousand observations,
and will not notice the wall-clock time differences I cite. However, I
believe they are worth addressing, both for the benefit of users with very
large datasets, and for all users who are tripped up when they fail to
remember which of the usual supported options are not supported by I/O
command they need.
Daniel Feenberg
NBER
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/