[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: puzzling benchmark results for MV probit
sacrificial line.
Stata folk:
I recently put together a quad processor Core 2 Duo Q6600 machine for
the express purpose of running multivariate probit Stata problems,
and similar types of Stata code.
My thought was that I would modify and run the multivariate probit
code making use of the MVNP plugin described by Capellari and Jenkins
in 'Calculation of Multivariate Normal Probabilities by Simulation,
with Applications to Maximum Simulated Likelihood Estimation'. The
Stata Journal, 6(2), pp 156-189. In addition to being faster than the
Capellari and Jenkins MVPROBIT routine, the ML code using MVNP
looked relatively easy to modify in order to specify starting values
for parameter estimates, rather than being required to use the
results of initial single equation probits as starting values, as
currently seems to be the case with MVPROBIT. (I basically have zero
experience programming STATA subroutines, and am reluctant to try to
learn enough to modify the ado code. The code in Capellari and
Jenkins, 2006, looks much easier to make small modifications to, with
STATA programming manual in hand.)
To get some idea of what I could expect, I ran the code for
illustration 2 (in C&J, 2006), downloadable
example test_mc_mvp3.do, on the following configurations of hardware:
a older dual core Athlon 64 X2 4200 running at 2.53 Ghz (modestly
overclocked), running Stata 9 MP and Stata 10 MP, 2 processor
versions. Scisoft Sandra memory benchmarks shows this machine having
bandwidth of about 4.7-4.8Gb/sec, latency of 93ns.
a Quad Core Intel Q6600 running at 2.4 Ghz (stock speed), running
Stata 10 MP 2 (using only 2 of the 4 cores), and Stata 10 MP 4
versions. Scisoft Sandra memory benchmarks shows this machine having
bandwidth of about 5.8Gb/sec, latency of 83ns.
A homebrew Intel core 2 duo E4300 overclocked to 2.52 Ghz, only has
single channel DDR memory, Scisoft Sandra memory benchmarks shows
this machine having bandwidth of about 3.5Gb/sec, latency of 114ns.
all the above machines have 2MB total memory.
The timer built into the example code gives the following elapsed times:
250 antithetic
draws by ML Code
Ath 64 X2
Stata 9 MP 2
cores 2.89 1434.39
Stata 10 MP 2
cores 2.48 1401.97
Intel Core 2 Duo Q6600
Stata 10 MP 4
cores .92 1806.88
Stata 10 MP 2
cores 1.03 1806.02
Intel Core 2 Duo E4300
Stata 10 MP 2
cores 1.14 1726.55
My conclusions
The Athlon 64X2 seems to run the ML w/MVNP plugin significantly
faster than either the either of the Intel Core 2 machines.
Disappointingly, there seems to be no speedup at all going from 2 to
4 cores. The slightly faster run on the E4300 is probably related to
the slightly faster (overclocked) clock rate on the E4300.
The Quad Core, with one or 2 cores, is faster than the Athlon running
at a slightly higher clock rate on MVPROBIT. Going from 2 to 4 cores
drops run times by about 40%. Memory bandwidth probably plays an
important role in explaining performance on MVPROBIT; the e4300 takes
more than double the time of 2 Q6600 processors running on a slightly
slower clock . MVPROBIT actually runs faster than ML with the MVNP
plugin on my quad core with either 2 or 4 cores. Not so on the E4300,
which leads me to believe that a large cache size must be needed to
enable MVPROBIT to run faster with more cores. Each of 2 pairs of
cores on the Q6600 has 4MB cache, the e4300 has 2MB cache, the Athlon
has 1MB (2x512K) total cache.
Mdraws has very slight speedup with more cores. The Athlon takes more
than twice as long on this as either of the Intel machines, so this
not being driven by fast or slow memory. My suspicion is that the
cache size is driving Mdraws performance.
The puzzles:
why is there no speedup with more cores with MVNP and ML? Why does my
old Athlon run MVNP faster than my new Q6600? Is the MVNP plugin
written or compiled in a manner that precludes it from making use of
multiple threads on more than a single core? (If so, this is a
significant drawback to the current version of the program.) Why does
the Athlon do significantly better on the ML with MVNP version of the
benchmark, but significantly worse with everything else? (Cache
size?) (Is the Athlon floating point math better, when MVNP
constrains the problem to run on only one processor?)
With a multicore machine with large cache and fast memory, it would
appear that the older MVPROBIT is actually a faster method than
ML/MVNP with plugin!
In any event, my plan of building a machine optimized to run these
things faster clearly needs tuning. Any insights as to what is going
on and what would be an optimal configuration for running this type
of problem would be greatly appreciated.
Ken Flamm
University of Texas at Austin
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/