Dear StataFolks,
the IT team at Duke Econ department is setting up a new computation
server, and they are asking me for some advice on how to optimize Stata
performance at the system level. So, if there are any Stata-on-UNIX gurus
around there -- what can be tweaked at the server side so that Stata
can run faster?
My first idea was to do something about the temporary files. I don't know
much about UNIX, but my first guess would be that there is an environment
variable that Stata looks at to see where it can put the temporary files.
(It would be nice if this was -set-able, but as far as I can remember the
previous discussions on the list, Stata Corp. was not particularly
enthusiastic about it.) The best option of course would be to have some
sort of RAM drive, so that -preserve- would mean copying a segment of RAM
to RAM rather than to hard drive. My guess would be that about half of
Stata commands would have -preserve- somewhere inside, and with a dataset
of say 400Mb -- and that's something a user will go with to the big
server, as most 20Mb data set jobs can be run on their desktops/laptops)
this may get a bit slow.
Is this how Stata works?
What else can we look at for computational speed-ups? Anything on job
allocation?
It will probably be quite beneficial if Stata users wrote their programs
efficiently, but this is something one can only hope for. I am not sure
every full professor using Stata will be totally willing to attend a class
on writing do-files :). Still, there must be some tricks for performance
tune-up, but I cannot think of very many:
1. -set memory- to the size you really need. It's hard to imagine that any
command on the data set of say 1Mb would ever need 40Mb for its operation;
but if a user runs 20 threads asking for 40Mb each, this will take quite a
bit of memory. Of course one does need some overhead for say a dozen
-tempvar-s... but with some commands such as -outreg-, the memory
requirements may get disastrous: for a data set of 1M observations,
-outreg- (as far as I understand) will create a bunch of text variables,
and if you want 20 symbols for your column title in your output, you would
need extra 20Mb for this luxury! (I dug into the -outreg- code to reduce
the size of the data set to something like 2*c(matsize), but there is
still an issue with -preserve-/-restore- that outreg does: each cycle
takes some 15 sec on my 300Mb data set.)
2. the standard -bootstrap- advice is to keep only the relevant variables
and observations; this will probably apply in the more general context,
too.
There should be some other pieces of advice on this, although many of them
may be problem specific (say using the results of -probit- to initialize
-xtprobit- or -gllamm-; using the results of -regress- to initialize
-probit- may also be beneficial theoretically, but my guess would be
that in most problems there won't be much time saved between saving the
-regress- results to a matrix and then pushing those into -probit-, as
compared to maximizing a nice convex -probit- likelihood from the
initial vector of zeros.). Does anybody have any suggestions or
references?
--- Stas Kolenikov
-- Ph.D. student in Statistics at UNC-Chapel Hill
- http://www.komkon.org/~tacik/ -- [email protected]
* This e-mail and all attachments to it are not intended to provide any
* reasonable point of view and was transmitted to you in error. It
* should be immediately deleted by all recipients unless they really
* enjoy communicating with the author :). Other restrictions apply.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/