|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: egen and Mata
Dear all,
Because I am working with large datasets I tried to code in Mata the
equivalent to the following Stata command:
egen newvar=mean(var), by(id)
believing that the results would be much faster. I coded two very
similar Mata routines, which are listed below.
To my surprise the egen command was faster. One of the Mata routines is
amazingly slow.
My questions are:
1) Why is mean_by2() so slow?
2) Is it possible to make the Mata code more efficient/faster than
Stata's egen?
Any comments/suggestions are appreciated
Paulo Guimaraes
************************************************************************
************************************************************************
set mem 500m
Current memory allocation
current memory usage
settable value description (1M = 1024k)
--------------------------------------------------------------------
set maxvar 5000 max. variables allowed 1.909M
set memory 500M max. data space 500.000M
set matsize 400 max. RHS vars in models 1.254M
-----------
503.163M
. set obs 100000
obs was 0, now 100000
. set more off
. egen id=seq(), from(1) to(10000) block(1)
. gen y=uniform()
. timer on 1
. egen double x1=mean(y), by(id)
. timer off 1
. timer on 2
. mata: mean_by1("y","id")
. timer off 2
. timer on 3
. mata: mean_by2("y","id")
. timer off 3
. timer list
1: 0.59 / 1 = 0.5940
2: 65.81 / 1 = 65.8130
3: 0.86 / 1 = 0.8590
*******************************************************************************************
*******************************************************************************************
mata:
void function mean_by1(string scalar var, string scalar groupid)
{
real colvector order, means, ym
numeric matrix data, info, X
st_view(id=.,.,groupid)
st_view(y=.,.,var)
order=J(rows(y),1,1)
order=runningsum(order)
data=(order,y,id)
_sort(data,3)
info=panelsetup(data[.,3],1)
means=J(rows(info),1,0)
ym=J(rows(data),1,0)
for (i=1; i<=rows(info); i++) {
X=panelsubmatrix(data[.,2],i,info)
means[i,1]=mean(X)
}
for (i=1; i<=rows(info); i++) {
means[i,1]=mean(panelsubmatrix(data[.,2],i,info))
}
data=(data[.,1],ym,data[.,3])
_sort(data,1)
index1=st_addvar("double","_y1")
st_store((1,rows(y)),index1,data[.,2])
}
end
mata:
void function mean_by2(string scalar var, string scalar groupid)
{
real colvector order, means, ym
numeric matrix data, info, X
st_view(id=.,.,groupid)
st_view(y=.,.,var)
order=J(rows(y),1,1)
order=runningsum(order)
data=(order,y,id)
_sort(data,3)
info=panelsetup(data[.,3],1)
means=J(rows(info),1,0)
ym=J(rows(data),1,0)
for (i=1; i<=rows(info); i++) {
X=data[| info[i,1],2\info[i,2],2|]
means[i,1]=mean(X)
}
for (i=1; i<=rows(info); i++) {
for (j=info[i,1]; j<=info[i,2]; j++) {
ym[j,1]=means[i,1]
}
}
data=(data[.,1],ym,data[.,3])
_sort(data,1)
index1=st_addvar("double","_y2")
st_store((1,rows(y)),index1,data[.,2])
}
end
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/