Hi,
we have a reshape-intensive application at work and we run it every
now and then; I thought that optimizing it for speed would be a good
idea since it takes 15 hours to run. It turned out that the real
bottleneck is reshaping a series of datasets with >900 variables and
between 100 and 10000 rows. I used a 500 row dataset as example, and
this code...
qui reshape long v, i(id) j(WEEK)
...took about 3' 02'', while this code...
* Read the dataset value by value and write
* a CSV file with the right shape
file open out using temp.csv, replace write
qui desc
local nweeks `r(k)'
qui count
local nids `r(N)'
file write out "id,WEEK,v" _newline
foreach n of numlist 1/`nids' {
local id = id in `n'
foreach v of numlist 2/`nweeks' {
local value = v`v' in `n'
local week = `v'
file write out "`id',`week',`value'" _newline
}
}
file close out
insheet using temp.csv, comma clear
...took about 12''. Both tests have been executed multiple times on
the following machine:
Stata/MP 10.0 for Windows 64-bit x86-64
Born 27 May 2008
Total physical memory: 67105388 KB
Available physical memory: 62904032 KB
Does anyone have any experience with this? Looking in the archive I
saw that this problem has been around since 2004 at least... I found
the suggestion of splitting the dataset into chunks but I doubt it
could be significantly faster than my naive code above; still, given
that my code is not optimized and it uses the hard disk as temporary
storage, I would think there is room for optimizing the original code
of "reshape" drastically. Any comments on this are welcome. Also,
optimization tricks of any kind are welcome, regardless of whether
they involve "reshape".
Thanks for the attention,
Mattia
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/