Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Cannot -file write- a line that was just successfully -file read-
From
"James Beard" <[email protected]>
To
[email protected]
Subject
Re: st: Cannot -file write- a line that was just successfully -file read-
Date
Thu, 01 Jul 2010 12:43:17 -0000
One possible cause of your problem is that
file write `out' `"`macval(line)'"' _n
will fail if the macro line contains the backquote character -`-
(despite what the documentation says about quoting strings).
Moving to Mata should get round this (and make your code faster).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I have two questions to the list with the supporting documentation
following them:
(1) What are the possible reasons for being able to -file read- a
line that one cannot -file write- and what additional processing of
the line macro might I do to avoid the error.
(2) I'd also like advice about whether it would be worthwhile writing
the file IO part of the routine below in mata for speed reasons, or
possibly to avoid problems with un-read/writable lines arising from
the presence of characters that cause problems in normal macro
processing.
Context:
I am undertaking a major rewrite of my large file chunking utility
-chunky- (-ssc describe chunky- for info) after a user encountered
problems with the routine halting at an unreadable/unwritable line in
a raw datafile he was chunking.
Approach:
The core of the chunking routine is this, a loop that -file read-s a
source file line by line and -file writes- the line to a destination
file. The current -chunky- routine had no error trapping in this
routine and would abort for no apparent reason. I have introduced
-capture-s into the read and write steps as follows:
=======code excerpt begin=======
forvalues r = 1/`lines' { // Move pointer to index line
capture file read `in' line
local rc = _rc
if `r(eof)' == 1 { //end of file
n di _n "{err:Terminating at end of file}"
local eof 1
continue, break
}
local ++index // increment infile line counter
if _rc != 0 {
n di _n "{err:chunky encountered unreadable data at {txt:file
index: }{res: `index'}}" _n ///
"{err:debug info: {txt:r(eof) = }{res:`r(eof)'} {txt:r(status) =
}{res:`r(status)'}}"
}
capture file write `out' `"`macval(line)'"' _n
if _rc != 0 {
n di _n "{err:chunky encountered unwritable line at file index
}{res: `index'}"
}
}
=======code excerpt end=======
Here are some lines from the user's output log generated while
chunking a 10Gb raw data dump (note: this is from a rewritten version
of -chunky- called -chunky_new- for testing purposes and the syntax
is new as well. The dots .. indicate successful completion of a
chunk.): . chunky_new using "X:\directory obscured\really really big
file.TXT", header(include) lines(500000) stub(chunk) replace
chunking ..
chunky encountered unwritable line at file index 688704
chunky encountered unwritable line at file index 770579
chunky encountered unwritable line at file index 998863
.
chunky encountered unwritable line at file index 1321586
...
What I find curious is that the offending line -file read-s OK, but
an error is captured at the -file write- step. I have thoroughly
read the -file- reference in online and manual documentation. The
use of a compound double quoted `"`macval(line)'"' is the accepted
way of outputting exactly what was read into the local macro line. -
file write- saves two results r(eof) and r(status) after every
operation (plus, of course, the ubiquitous error return code _rc) and
I have debugging code to indicate if one of the trappable errors
occurs such as a too long or unterminated line. Encountering an end
of file (r(eof)==1) simply breaks out of the loop.
Thank you.
(Incidentally, when this problem is solved, I will be replacing the
current version of chunky.ado that is on SCC with this one. Anyone
with large file chunking/splitting needs who wishes to beta test this
new routine should contact me off-list.
Improvements in the beta version include:
* Significant speed increase
* Much simpler setup through reconceptualizing the actions of the
routine
* Better syntax and error checking
* A peek(n) option to look at the first n lines of a file
* An analyze option to estimate the number of lines to read for
various chunk and Stata filesizes as well as check for potential
problems arising from extended ASCII characters )
--
David Elliott
Everything is theoretically impossible, until it is done.
Progress is made by lazy men looking for easier ways to do things.
-- Robert A. Heinlein (American science-fiction Writer, 1907-1988)
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/