Wallace, John
> I've just finished writing my first .do file for a truly
> enormous data
> processing task. Its now running, and I'm underwhelmed at
> the pace its
> going at. I'll describe the dataset, the task, and the .do
> file; please
> comment on my approach and whether there is a more
> efficient way to run it.
>
> I have a set of ~1100000 records, consisting of 3 supergroups of 6
> replicates. Each replicate has ~61000 analytes. Each
> analyte is tested
> across a pair of supergroups in an unpaired t-test, with 6
> replicates.
>
> Incidentally, if I'd had my way, we'd be using a oneway anova with a
> bonferroni correction for significance, but the person
> requesting the
> analysis wanted t-tests. I'm not sure that this would
> improve the speed of
> the processing though (I imagine I'll find out later, since
> I'll eventually
> get my way with the analysis approach)
>
> I'm using the following variables
> analyte = member of ~61000 records (string)
> numanalyte = -encode-d analyte
> q = counter for the set of supergroups in the t-test
> I = counter for the t-test within the set of supergroups
> p`q' = title of variable in dataset for recording the
> calculated p-value of
> the test
> numsgroup = -encode-d supergroup (1, 2, or 3)
> det = float number being tested
>
> .do-file:
>
> set more off
>
> encode(analyte), gen(numanalyte)
> sum numanalyte
> local min = r(min)
> local max = r(max)
>
> forvalues q = 1(1)3 {
> display "ttest "`q'
> g p`q' = .
> forvalues i = `min'(1)`max' {
> display `i'
>
> if `q' == 1 {
> quietly ttest det if numanalyte ==
> `i' & numsgroup
> !=3, by(numsgroup) unpaired
> }
> else if `q' == 2 {
> quietly ttest det if numanalyte ==
> `i' & numsgroup
> !=2, by(numsgroup) unpaired
> }
> else {
> quietly ttest det if numanalyte ==
> `i' & numsgroup
> !=1, by(numsgroup) unpaired
> }
>
> capture replace p`q' = r(p) if numanalyte == `i'
> }
> }
> set more on
> exit
> end
>
> I'm monitoring the progress of the analysis by -display-ing
> `q' and `i'.
> I'm getting a new `i' displayed about once every 3.6
> seconds. This leads me
> to think the entire analysis is going to take a few days!
> I've got a Dell
> Xeon workstation with dual 1.4GHz processors and 0.5GB
> memory, and more than
> sufficient hard drive space. I've allocated 200M to Stata,
> and I'm running
> Stata8, fully updated(9/30).
>
> Incidentally, I pre-sorted the dataset by analyte and
> supergroup in the hope
> that "making them close together" would speed processing.
>
> 60 mins in, 600 tests done...it seems to be slowing down (uhoh)
David Airey has given several important pointers.
The main issue, I guess, is that you are looping over groups
when this can be vectorised. Also, a wide data structure
may be preferable.
It is worth underlining that -if- can be very slow, as Michael
Blasnik has emphasised many times. There is no special logic
whereby Stata goes straight to the observations required
and works with them. Rather it blindly goes through
every observation and tests whether the -if- condition
is satisfied. With a million observations looped over
repeatedly, this is not trivial, as you have observed. One
remedy is to recast the problem using -in-, but the solutions
pointed out by David are better in this case.
I add a few extra comments on what makes this slow.
First, the -display- to see how fast it's going itself
shows things down.
Second, -summarize-:
sum numanalyte
local min = r(min)
local max = r(max)
If you only want min and max, use the -meanonly-
option. However, this is done only once and is
not the main issue.
Third, there is no gain in setting up an outer loop over `q'
as you throw away the saving by repeatedly
testing within it for the value of `q'.
Explicit code should be faster.
g p1 = .
g p2 = .
g p3 = .
forvalues i = `min'(1)`max' {
quietly ttest det if numanalyte == `i' & numsgroup
!=3, by(numsgroup) unpaired
capture replace p1 = r(p) if numanalyte == `i'
quietly ttest det if numanalyte == `i' & numsgroup
!=2, by(numsgroup) unpaired
capture replace p2 = r(p) if numanalyte == `i'
quietly ttest det if numanalyte ==
`i' & numsgroup !=1, by(numsgroup) unpaired
capture replace p3 = r(p) if numanalyte == `i'
}
}
I don't see that you need the -capture- there at all.
These savings will probably be much less than the other
savings from avoiding a loop over groups and -ttest-
with -if-.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/