Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: CSV read with limits


From   Steven Samuels <[email protected]>
To   [email protected]
Subject   Re: st: CSV read with limits
Date   Sat, 12 Mar 2011 19:43:20 -0500

--

I this has already been mentioned, I apologize.  If you have an *x operating system, including Mac OS X, or if you install Cygwin on Windows, you can use -awk-.  I've tested it on files of about 10^6 lines.

***************************CODE BEGINS******************

/*
* Data is in seven comma-separated fields in two-line source.txt
* which is:
*"1,2,3,4,55,6,77"
*"11,22,33,44,5,66,7"
*
* We want fields 3 and 6 in  all lines after the first (here line 2)
*/

shell awk 'BEGIN {FS=","; OFS =","} ; {if (NR>1) print($3, $6)}' source.txt > in.txt;
type in.txt
insheet x3 x6 using in.txt, comma
list
***************************CODE ENDS***************************


Steve

Steven J. Samuels
Consultant in Statistics
18 Cantine's Island
Saugerties, NY 12477 USA
Voice: 845-246-0774
Fax:   206-202-4783 
[email protected]




On Mar 12, 2011, at 6:24 PM, Mike Lacy wrote:


Argyn Kuketayev <[email protected]> wrote:

>I have CSV file (comma separated). I need Stata to read the 1st line
>with variable names, then import only selected variables. Also I want
>to limit the number of observations to read.
>
>I cant figure out how to do it in Stata. In SAS it would be easy with
>DATA, var list and OBS option.
>
>thanks
>
>- --
>Argyn Kuketayev

I agree with Argyn that this is more difficult than it should be, as I've face similar problems myself.
Considering that many large data sets are distributed as csv, having an option on insheet to read a limited number of lines and/or variables would be natural.  That being said, here's a way to solve Argyn's problem with available tools.

The user written program -chunky- can break CSV files into chunks, while retaining the header on each one.  It can be used easily (if a bit inefficiently) to address the current problem, as follows:


// Make a large-ish CSV file to work with
clear
set obs 10000
forval i = 1/200 {
  gen x`i' = 100 * runiform()
}
outsheet using c:\temp\big.csv, comma names nolabel replace
//
//
// Get the user-written program -chunky- and use it to break up the file into chunks
ssc install chunky
// ********** Real work starts here
cd c:\temp  // need somewhere to put the chunks
// choose # of bytes in each chunk; larger is faster
local size = 10000000
chunky using c:\temp\big.csv, chunksize(`size') header(include) stub(piece) replace
insheet using "piece0001.txt", clear comma names  // chunky names files consecutively
keep x1 x100 x150 x200
keep in 1/500 // retain the desired lines
//
foreach f in `s(filelist)' {
  erase `f'
}

The preceding would be less clumsy if -chunky- had options to
1) allow reading of just one chunk, with size specified by line length;
2))use  tempfiles to store the pieces.

These are not complaints, just some thoughts about about useful options that I suspect are consistent with the way -chunky- works.

Regards,




=-=-=-=-=-=-=-=-=-=-=-=-=
Mike Lacy, Assoc. Prof.
Soc. Dept., Colo. State. Univ.
Fort Collins CO 80523 USA
(970)-491-6721   

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index