I'm writing a routine to infile about 20K scans of questionnaires
which are saved 1 questionnaire per file by the scanning service.
Obviously a bit of appending will be needed.
My question surrounds the action of -preserve-. When a dataset in
memory is -preserve-ed, is it just written to disk in a temporary
file, or is it tucked into a corner of Stata's memory?
Consider the following:
/* Edit the following directory to the root of the scans*/
set more off
capture file close scan
cd working "c:\data\scans"
local firsttime 1
!dir *.txt /s/b >scans.txt //create a textfile of all scan filenames
tempname scan
file open `scan' using scans.txt, read text
file read `scan' line
while !r(eof) {
if `firsttime' {
infile using survey_dict, using(`"`macval(line)'"') clear
save survey_2007 , replace
local firsttime 0
}
preserve
tempfile next
infile using dsurvey_dict,, using(`"`macval(line)'"') clear
save `next'
restore
append using `next'
n di `"read `macval(line)'"'
file read `scan' line
}
file close `scan'
versus
/* Edit the following directory to the root of the scans*/
set more off
capture file close scan
cd working "c:\data\scans"
local firsttime 1
!dir *.txt /s/b >scans.txt //create a textfile of all scan filenames
tempname scan
file open `scan' using scans.txt, read text
file read `scan' line
while !r(eof) {
if `firsttime' {
infile using survey_dict,, using(`"`macval(line)'"') clear
save survey_2007 , replace
local firsttime 0
}
infile using survey_dict,, using(`"`macval(line)'"') clear
append using survey_2007
save survey_2007, replace
n di `"read `macval(line)'"'
file read `scan' line
}
file close `scan'
In the first case, if -preserve- keeps the data in memory, then only
the file to be appended needs to be written and subsequently appended
following a -restore-. In the second case, the latest infile becomes
the dataset in memory to which the steadily accumulating dataset is
written. This becomes a larger and larger file I/O overhead on the
routine as the dataset gets larger. I have used -preserve- & -restore-
with large (>200MB) files before and it has been my impression that
the first preserve takes longer than subsequent ones which suggests to
me that either Stata does something in memory with a -preserve- or
what I am seeing is an effect of caching by the OS and/or HDD
controller.
Most times people aren't infiling 10s of thousands of files, but when
faced with this situation, taking some time to achieve efficiency can
pay off.
--
David Elliott
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/