Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: how to open very large txt files (>6G)


From   "David Elliott" <[email protected]>
To   [email protected]
Subject   Re: st: how to open very large txt files (>6G)
Date   Wed, 1 Oct 2008 17:30:56 -0300

I have used -file- to read extremely large ~2G files. (2G being the
limit for filesize in 32bit Windows).

Often what one wants is to be able to read the header and determine
variablenames and types.  Once you have this information it is then
possible, for example, to import a huge file in chunks, saving those
variables that you need.  One can process arbitrarily large files in
this fashion.

I've tried to work with a number of programs that purport to be able
to handle extremely large files by buffering only a part of it in
memory and have never been satisfied with their performance.

I created the following -chunky- to chunk a huge csv file I received
into digestible bits.  It might be useful to you.  Run without
options, it will return the first 5 lines of a file.  The further into
a file you need to start, the longer the routine takes to run because
indexing has to be done from the first line every time (this wouldn't
be the case with fixed line lengths, but that's another story)
However, on a ~2Gig file with 173 variables, it took my 5 year old
computer about 90 seconds to get to the last line.  A r(index) is
returned which allows one to put the chunks in a loop and sequentially
chunk a file starting at the index of the next chunk.

I've been wondering if I would get a big speed increase with Mata
since the loop through the lines would be compiled.  Also, Mata allows
direct recording of the byteindex position in the file with -ftell()-
and returning to it with -fseek()-, useful in sequentially chunking a
file.

DCE

=======listing of chunky.ado========

program define chunky, rclass
version 8.0

*! version 1.0.0  2008.04.26
*!
*! by David C. Elliott
*! Text file chunking algorithm
*!
*! syntax:
*! using filename
*! index() is starting line in file to be read
*! chunk() is the number of lines to be read
*! saving() is file name of chunk to be saved, defaults to chunk.txt
*! list displays line by line listing of file to screen
*!
*! returns r(index) as the index of the last line read+1
*!
*! note - this works on text files only

syntax using [, Index(numlist max=1 >0 integer) ///
    Chunk(numlist max=1 >0 integer) Saving(string) List]

if `"`saving'"'=="" {
    local saving chunk.txt
    }
tempname in out
file open `in' `using', read
file open `out' using `"`saving'"', write replace

if "`index'"=="" {
    local index 1
    }
if "`chunk'"==""  {
    local chunk 5
    }
if "`list'" == "list" {
    local list
    }
    else {
        local list *
        }
local end = `index' + `chunk'
local i 0

while `i++'<`index' {  // Move pointer to index line
        file read `in' line
        if r(eof) != 0 {
            di _n "{err:Index `index' is past end of file}" ///
            _n "{err:Last line attempted was `i'}"
            exit
            }
}

while r(eof) == 0 & `index' < `end' {
    file write `out' `"`macval(line)'"' _n
    `list'    di in ye `index' `" `line'"'
    local ++index
    file read `in' line
    }

file close `in'
file close `out'

return scalar index = `index'
view `"`saving'"'

end

============end listing=============
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index