|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Re: speed question: collapse vs egen
First of all let me say that I think the notion of machine- and
operating-system-specific plugins for Stata is largely obsolete.
StataCorp itself has moved heavily from development in C to
development in Mata, and the avowed aim is to have virtually all of
Stata "written in Stata": that is, in ado-code or Mata. Yes, it is
slower than pure C, but it is also much easier to code, maintain and
support. We have over the years had a couple of routines with plugins
in the SSC Archive and those developers who tried to make them
available for multiple platforms were going nuts. StataCorp has one
of every machine that they support in-house, so they can afford to
develop and distribute working C "DLLs" for every combination. Most
of us do not, and code that only works on a particular platform is
not IMHO very useful.
I'm sure that Bill Gould can spot those places in this code which
would make it more efficient. But as should be evident Mata can do
this job quite respectably, improving on pure Stata code. If I cheat
and take advantage of the fact that the by-var rep78 takes on values
1,2,3,4,5, it runs about 20% faster than this. But here are my
timings for Sergiy's program, where for the fourth method I have
replaced his routine (which would not run on my machine anyway) with
my Mata call:
. timer list
1: 18.08 / 1 = 18.0780
2: 16.96 / 1 = 16.9580
3: 14.17 / 1 = 14.1700
4: 7.70 / 1 = 7.6980
with results
1 2
+-----------------------------+
1 | 1 4564.5 |
2 | 2 5967.625 |
3 | 3 6429.233333 |
4 | 4 6071.5 |
5 | 5 5913 |
+-----------------------------+
The Mata code and the Stata code calling it is:
mata:
void mucalc2(string scalar bv,
string scalar vv,
string scalar touse)
{
mu = J(0, 2, .)
st_view(X=., ., (bv, vv), touse)
a = strtoreal(tokens(st_local("rr")))
for(i=1; i <= cols(a); i++) {
mu = mu \ mean(select(X, X[.,1] :== a[i]))
}
mu
}
end
timer on 4
mark touse
// handles missings in both price and rep78
// also not limited by Stata's matrix limits
markout touse price rep78
qui levelsof rep78, local(rr)
mata: mucalc2("rep78", "price", "touse")
timer off 4
This code may not be as fast as Sergiy's plugin (and both his code
and this Mata code can doubtless be improved) but it is a hell of a
lot more portable, as it will run on any machine with Stata 9.x or
better. I think that development along these lines is much more in
keeping with the spirit of the Stata user community.
For Mata mavens, note that my first draft made use of panelsetup()
using rep78 as the panel variable. It worked, but turned in timings
almost identical to that of the Stata-based methods 1,2,3.
Kit
Kit Baum, Boston College Economics and DIW Berlin
http://ideas.repec.org/e/pba1.html
An Introduction to Modern Econometrics Using Stata:
http://www.stata-press.com/books/imeus.html
On Apr 26, 2008, at 02:33 , Sergiy wrote:
Jeph has asked about an efficient way of creating a dataset with means
of one variable over the categories of another variable. He suggested
two possible solutions and Stas added a third one.
Below I report performance of each of these methods and compare it
with the fourth: a plugin.
I use an expanded version of auto.dta and tabulate mean {price} by
different levels of {rep78}.
1. All methods resulted in the following table of results*
meanprice rep78
4564.5 1
5967.625 2
6429.233 3
6071.5 4
5913 5
2. The timing is as follows (Stata SE, Windows Server 2003, 32-bit)
1: 33.80 / 1 = 33.7960
2: 31.22 / 1 = 31.2190
3: 21.33 / 1 = 21.3280
4: 5.58 / 1 = 5.5780
3. Since the plugin was intended for similar but not exactly the same
purposes, it does some extra work (simultaneously computing
frequencies, etc), which means that this is not the ultimate record.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/