Here is a revised version of my file chunking program, chunky.ado :
*----------------- Begin listing ----------------*
program define chunky, rclass
version 8.0
*! version 1.0.0 2008.04.26
*! version 1.0.1 2009.01.20
*!
*! by David C. Elliott
*! Text file chunking algorithm
*!
*! syntax:
*! using filename
*! index() is starting line in file to be read
*! chunk() is the number of lines to be read
*! saving( [,replace]) is file name of chunk to be saved,
*! defaults to chunk.txt
*! list displays line by line listing of file to screen
*! used to display first line or in debugging
*! returns r(index) as the index of the last line read+1
*!
*! note - this works on text files only
syntax using [, Index(numlist max=1 >0 integer) ///
Chunk(numlist max=1 >0 integer) Saving(string) List]
local infile `using', read
if `"`saving'"'=="" {
local savefile using chunk.txt, write replace
}
else {
local 0 `saving'
syntax [anything(name=savefile id="file to save")][,REPLACE]
local savefile using `savefile', write `replace'
}
tempname in out
file open `in' `infile'
file open `out' `savefile'
if "`index'"=="" {
local index 1
}
if "`chunk'"=="" {
local chunk 5
}
if "`list'" == "list" {
local list
}
else {
local list *
}
local end = `index' + `chunk'
local i 0
while `i++'<`index' { // Move pointer to index line
file read `in' line
if r(eof) != 0 {
di _n "{err:Index `index' is past end of file}" ///
_n "{err:Last line attempted was `i'}" _n
return scalar eof=1
exit
}
}
while r(eof) == 0 & `index' < `end' {
file write `out' `"`macval(line)'"' _n
`list' di in ye `index' `" `line'"'
local ++index
file read `in' line
}
file close `in'
file close `out'
return scalar index = `index'
return scalar eof = 0
end
*-----------------End listing ----------------*
And a revised version of a program using it to chunk and reassemble a
large file keeping only certain variables:
*----------------- Begin listing ----------------*
// Do file using chunky.ado to piece together
// parts of a very large file
// Pay particular attention to the edit points
// marked with ****
// for infile and chunksize and keep
**** edit VeryLargeFile.csv on the following line to your filename
local infile VeryLargeFile.csv
// edit to size of chunk you want
local chunksize 100000
// Get just the first line if it has variable names
chunky using `"`infile'"', index(1) chunk(1) ///
saving("varnames.csv",replace) list
local chunk 1
local nextrow 2
tempfile chunkfile chunkappend
while !`r(eof)' { // abort when end of file reached
chunky using `"`infile'"', ///
index(`r(index)') chunk(`chunksize') saving("`chunkfile'", replace)
if `r(eof)' {
continue, break
}
else {
local nextrow `=`r(index)'+1'
}
// shell command to append the varnames with the chunk
!copy varnames.csv+`chunkfile' `chunkappend'
**** edit the following to conform to your csv delimiter
insheet using "`chunkappend'", clear comma names
**** edit the following to keep specific variables
keep *
// save part of file and increment chunk count
save part`chunk++', replace
}
// Append parts together
local nparts `--chunk'
use part1, clear
forvalues i=2/`nparts' {
append using part`i'
**** uncomment the following line to erase part2.dta...part##.dta
// erase part`i'.dta
}
describe
// You will probably want to save part1.dta to a different name
// once all the parts are appended to it.
*-----------------End listing ----------------*
Typical output:
(88 vars, 100000 obs)
file part1.dta saved
(88 vars, 100000 obs)
file part2.dta saved
(88 vars, 100000 obs)
file part3.dta saved
(88 vars, 100000 obs)
file part4.dta saved
(88 vars, 100000 obs)
...
Contains data from part1.dta
obs: 1,244,282
vars: 28 21 Jan 2009 10:50
size: 562,519,688 (27.8% of memory free)
As you can see, you can get truly large datasets processed in this
manner and it is done entirely from within Stata.
As an aside - is this ado possibly useful enough to write a help file
and submit to SSC?
DC Elliott
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/