Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: insheet multi threading
From
Mike Lacy <[email protected]>
To
[email protected]
Subject
Re: st: insheet multi threading
Date
Tue, 03 May 2011 12:17:07 -0600
>Date: Mon, 2 May 2011 09:30:48 -0400
Argyn Kuketayev <[email protected]> wrote:
>I'm not talking about some obscure command either. it's a very basic
>task, and I'm sure everyone does it daily: read CSV files. it takes
>over 1 hour on 8-core machine to read 13GB file, because CPU load is
>12% all the time, one core is working.
>it's a junior programmer level assignment to parallelize the parsing
>part, that's why i'm surprised Stata didn't do it. it's frustrating
>because sometime i get CSVs during the day, and have to wait long long
>time before i can upload them into Stata. once in .dta format, all is
>fast: reading and writing. so, it's clearly parsing part that is slow
Here's an inelegant approach that might nevertheless work to
parallelize your job. Whether it works depends on what seems to be
true on my dual processor machine running Windows and a single
processor version of Stata. It seems that a new instance of Stata,
running concurrently is allocated (by Windows XP, in my case) to a
different processor than the one running the first instance. This
claim is based on simultaneously running the same long job in two
different instances of Stata, and having it take much less than twice as long.
If this is true, and perhaps generalizes to other operating systems
and machines with more processors, you could:
1) use -chunky- (-findit chunky-) to break up your CSV file into
multiple CSV files with the same original header with variable names.
2) Take the list of file names that -chunky- returns, and break it
into (say) 4 lists.
3) In the current instance of Stata, start -insheeting- the files on
the first list and saving them as *.dta files.
4) For each of the 3 remaining lists, start a new instance of Stata
to -insheet- each list.
5) Append all the *.dta files together.
This could be automated, I presume, by starting one or more other
instances of Stata as batch jobs on your machine from its command
line; you presumably could even call these other instances of Stata
from within Stata. I freely admit that this approach is clumsy, and
would involve a fair amount of extra I/O, but it might be quite a bit
faster, if you're right that parsing is the rate-determining aspect of the job.
I think, but I'm not sure, that this does not violate the terms of
the one user at a time licensing of Stata.
Regards,
=-=-=-=-=-=-=-=-=-=-=-=-=
Mike Lacy, Assoc. Prof.
Soc. Dept., Colo. State. Univ.
Fort Collins CO 80523 USA
(970)-491-6721
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/