David Airey mentions that his factorial repeated-measures analysis of variance
is taking more than eight hours to finish. David describes his analysis as
"384 observations, 2 between subject factors, 4 within subject factors," which
if it's balanced would be a 2 � 2 � 2 � 2 � 2 � 2 repeated-measures ANOVA, with
last four factors as repeated measurements. I know that -anova- can take a
while to complete when there are numerous factors and their interactions to
estimate, but eight hours seems long to me for a problem of this size.
To see how rapidly this analysis would run on my laptop (2 GHz nominal, 512
megabytes RAM, Windows XP), I created an artificial dataset that mimics David's
in what I understand as his experimental design. The do-file is attached
below. For reference, the between-subject factors are named prt (pretreatment)
and trt (treatment), and the within-subject factors are named alphabetically.
The statistical model of the data was fully saturated, that is, with all
interaction terms, and I believe although am not certain that I specified it
correctly. -anova- took 25 minutes including floppy disc access time to
log the output. This is longer, of course, than the 30 seconds claimed for
SAS's PROC MIXED, but not hours longer. I did not use (need) a matrix size of
6000, but I doubt that it would have substantially increased the computation
time if I did set the matrix size limit that large.
Joseph Coveney
-------------------------------------------------------------------------------
clear
set more off
set matsize 2400
set obs 384
set seed 20030928
* First between-subject factor (pretreatment)
generate byte prt = _n > _N / 2
* Second between-subject factor (treatment)
sort prt
generate byte trt = mod(_n, 2)
* Subject identifier
sort trt prt
generate byte pid = mod(_n, 16) == 1
replace pid = sum(pid)
tabulate prt trt
* Balanced completely randomized factorial design
* First within-subjects factor
sort pid // Not really necessary
generate byte A = mod(_n, 2)
* Second within-subjects factor
sort pid A
generate byte B = mod(_n, 2)
* Third within-subjects factor
sort pid A B
generate byte C = mod(_n, 2)
* Fourth within-subjects factor
sort pid A B C
generate byte D = mod(_n, 2)
sort pid A B C D
by pid: generate float latent_variable = invnorm(uniform()) if _n == 1
by pid: replace latent_variable = latent_variable[1]
generate float dep = 0.7 * latent_variable + (1 - 0.7^2) * invnorm(uniform())
drop latent_variable
* Strictly additive (no interactions of any factors)
replace dep = dep - prt / 6 + trt / 6 - A / 6 + B / 6 - C / 6 + D / 6
capture log close
log using complicated_anova.smcl, replace
set rmsg on
anova dep prt trt prt*trt / prt*trt|pid ///
A prt*A trt*A prt*trt*A / prt*trt*A|pid ///
B prt*B trt*B prt*trt*B / prt*trt*B|pid ///
A*B prt*A*B trt*A*B prt*trt*A*B / prt*trt*A*B|pid ///
C prt*C trt*C prt*trt*C / prt*trt*C|pid ///
A*C prt*A*C trt*A*C prt*trt*A*C prt*trt*A*C|pid ///
B*C prt*B*C trt*B*C prt*trt*B*C / prt*trt*B*C|pid ///
A*B*C prt*A*B*C trt*A*B*C prt*trt*A*B*C / prt*trt*A*B*C|pid ///
D prt*D trt*D prt*trt*D / prt*trt*D|pid ///
A*D prt*A*D trt*A*D prt*trt*A*D / prt*trt*A*D|pid ///
B*D prt*B*D trt*B*D prt*trt*B*D / prt*trt*B*D|pid ///
C*D prt*C*D trt*C*D prt*trt*C*D / prt*trt*C*D|pid ///
A*B*D prt*A*B*D trt*A*B*D prt*trt*A*B*D / prt*trt*A*B*D|pid ///
A*C*D prt*A*C*D trt*A*C*D prt*trt*A*C*D / prt*trt*A*C*D|pid ///
B*C*D prt*B*C*D trt*B*C*D prt*trt*B*C*D / prt*trt*B*C*D|pid ///
A*B*C*D prt*A*B*C*D trt*A*B*C*D prt*trt*A*B*C*D
log close
help smileplot
exit
--------------------------------------------------------------------------------
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/