John Wallace <[email protected]> asks:
> I've recently grasped the concept of nested factors, which should allow
> me to delve further into the subtleties of the data I'm studying.
> Previously I had recognized that there would be a problem with the
> assumptions of independence for comparing measurements of things nested
> within common factors, and I dealt with it by averaging everything
> (probe intensities within probe sets within chip replicates in my case)
> that couldn't be shown to be independent. Anyway, I've figured out (I
> think) how Stata deals with nesting factors, and simultaneously
> discovered that it __really__ increases the processing demand on the
> system! I've had to blow up my matsize from the default 400 to 4800,
> and it's taking a lot longer to get results from a 4000 observation, 7
> factor ANOVA. Two questions:
>
> 1. Is this to be expected, or am I still not describing something
> properly in my ANOVA statement to Stata?
>
> 2. What's happening behind the scene that causes the matsize inflation
> and bogging of processing speed?
>
> For your amusement, here's the ANOVA statement:
>
> anova logcpint benchdwellmin hyb nMES ///
> die / die|unit / die|unit|atom ///
> benchdwellmin*nMES ///
> ,continuous(benchdwellmin) partial regress anova
>
> I split it up across lines like that to help account for the factors,
> nesting levels, and interactions I'm trying to observe.
The large size is probably expected (it depends on how many "die"
are nested in "unit" and how many "unit" are nested in "atom" and
how many "atom" there are).
In "[R] anova" the skin rash example (starting on page 61 of
Version 8 manual) has patient nested in doctor which is nested in
clinic which is nested in treatment. To look at the data you can
. webuse rash
. describe
For this data there are 2 treatments, 4 clinics per treatment, 3
doctors per clinic, and 4 patients per doctor.
When you do
. anova response t / c|t / d|c|t / p|d|c|t /
The design matrix (and resulting covariance matrix e(V)) is 131
by 131. Where did 131 come from?
# of
term columns df notes
------------------------------------------------------------
_cons 1 1
t 2 1 2 treatments ; (2-1)=1 df
c|t 8 6 2*4=8 unique clinics; 2*(4-1)=6 df
d|c|t 24 16 2*4*3=24 unique docs; 2*4*(3-1)=16 df
p|d|c|t 96 72 2*4*3*4=96 patients ; 2*4*3*(4-1)=72 df
--- --
131 96 (96 = 1 for constant + 95 for model)
So, for your data depending on how many die are nested in unit
which are in turn nested in atom, you could indeed end up with a
very large matrix. The computation time correspondingly also
goes up as the required matrix size increases.
The difference between the degrees of freedom and # of columns in
the design matrix (96 versus 131 for the rash data) is due to the
use of the standard "overparameterized ANOVA model" used by
-anova-. When you look at the underlying regression (using the
-regress- option of -anova- or executing -regress- after your
ANOVA to replay it as a regression table), You will see that 35 =
131-96 coefficients are "dropped" for the rash data.
Ken Higbee [email protected]
StataCorp 1-800-STATAPC
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/