Dear _all,
there has been a thread a few months ago about the relative speed of
-egen- and mata. See
http://www.stata.com/statalist/archive/2008-07/msg00550.html and Bill
Gould's reply: http://www.stata.com/statalist/archive/2008-07/msg00582.html
I did similar tests before discovering that thread, and I wanted to add
a few comments in case someone would be interested.
The apparent superiority of -egen- appears to be due to the efficiency
of -bysort-. Indeed, the mata equivalent to -egen total- runs faster,
but not the equivalent to -by id : egen total-.
More importantly (I think), even if mata will be slower than -by id :
egen total- for a single calculation, it will be faster if one wants to
compute the total sums of several variables at once because mata can
calculate those at once, while you would need to run several -egen-
commands.
He is a summary of the timings (in seconds) on my machine:
panel dataset, 1500 ids*300 periods
-----------------------------------------
Total sum of a single variable, no -by-
mata: 0.1090
egen: 0.7660
-----------------------------------------
-----------------------------------------
Total sum of a single variable, with -by-
mata: 0.7340
egen: 0.5780
-----------------------------------------
-----------------------------------------
Total sums of 5 variables, no -by-
mata: 0.4850
egen: 3.8900
-----------------------------------------
-----------------------------------------
Total sums of 5 variables, with -by-
mata: 1.0310
egen: 2.8900
-----------------------------------------
Somewhat surprisingly, the mata equivalent to -egen min- is even faster
compared to -egen-, and is in fact always faster, even with -by- and a
single variable:
-----------------------------------------
Min of a single variable, no -by-
mata: 0.1090
egen: 2.8750
-----------------------------------------
-----------------------------------------
Min of a single variable, with -by-
mata: 0.7500
egen: 2.6560
-----------------------------------------
-----------------------------------------
Mins of 5 variables, no -by-
mata: 0.4850
egen: 14.6250
-----------------------------------------
-----------------------------------------
Mins of 5 variables, with -by-
mata: 1.0620
egen: 13.6410
-----------------------------------------
Here are the mata codes I used, I do not claim they are the most
efficient one could think of...
/*-----total sum, no by------*/
mata:
void somme(string vector in , string vector out ){
st_view(x, ., (tokens(in)))
sx=J(rows(x),1,colsum(x))
idx = st_addvar("float", (tokens(out)))
idx
st_store(. , idx , sx)
}
end
/*-----------------------------*/
/*------total sum, with by------*/
mata:
void sommeby(string scalar p , string vector in , string vector out ){
st_view(id, ., p)
V=panelsetup(id, 1)
st_view(x, ., (tokens(in)))
sx=J(rows(x),cols(x),.)
for (i=1; i<=rows(V); i++) {
panelsubview(X, x, i, V)
sx[V[i,1]::V[i,2],.]=J(rows(X),1,colsum(X))
}
idx = st_addvar("float", (tokens(out)))
st_store(. , idx , sx)
}
end
/*-----------------------------*/
/*---------minimum, no by---------*/
mata:
void mmin(string vector in , string vector out ){
st_view(x, ., (tokens(in)))
sx=J(rows(x),1,colmin(x))
idx = st_addvar("float", (tokens(out)))
idx
st_store(. , idx , sx)
}
end
/*---------------------------*/
/*----------minimum, with by----------*/
mata:
void mminby(string scalar p , string vector in , string vector out ){
st_view(id, ., p)
V=panelsetup(id, 1)
st_view(x, ., (tokens(in)))
sx=J(rows(x),cols(x),.)
for (i=1; i<=rows(V); i++) {
panelsubview(X, x, i, V)
sx[V[i,1]::V[i,2],.]=J(rows(X),1,colmin(X))
}
idx = st_addvar("float", (tokens(out)))
st_store(. , idx , sx)
}
end
/*----------------------------------*/
Best,
Antoine
--
Ce message a ete verifie par MailScanner
pour des virus ou des polluriels et rien de
suspect n'a ete trouve.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/