Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Efficiently using associative arrays in mata
From
Andrew Maurer <[email protected]>
To
Statalist Statalist <[email protected]>
Subject
st: Efficiently using associative arrays in mata
Date
Wed, 5 Mar 2014 00:12:16 +0000
Hi Statalist,
I am trying to write a function in mata that returns sums by groups (like collapse), but without sorting. I am interested in calculating sums over groups for very large datasets (eg, 1billion observations). I have run into issues with an individual "sort" command taking around 60min to complete.
I suspect that since collapse internally uses a sort, computation time is O(nlogn), or something else worse than linear. However, using associative arrays, the process could be done in linear time (though I'd imagine with higher memory usage). Ie - loop once through the whole dataset, storing the running sum of each groupid separately. I have a working example below, but the "loop through observations" section is slower than expected (using collapse as a benchmark, it's 2-3x slower for up to 10million obs).
I suspect that the issue is in the line:
asarray(mydict, thisid, asarray(mydict, thisid) + thisx)
The problem is that asarray() is called twice, so the memory address of thisid in mydict has to be looked up twice. Does anyone have any suggestions on a way to complete this task, only looking up the address of thisid once, without having to rewrite the asarray() function?
************** begin code **********************
clear all
mata
matrix array_runningsum(idname, xname) {
// load data into mata
st_view(id=0, ., idname)
st_view(data=0, ., xname)
// initialize array
mydict = asarray_create("real", cols(id))
asarray_notfound(mydict, 0)
// loop through observations and calculate running sum
for (i=1; i<=rows(data); i++) {
thisid = id[i,.]
thisx = data[i,1]
asarray(mydict, thisid, asarray(mydict, thisid) + thisx)
}
// convert array to matrix
N = asarray_elements(mydict)
A = J(N,(1+cols(id)),.)
i = 1
for (loc=asarray_first(mydict); loc!=NULL; loc=asarray_next(mydict, loc)) {
A[i++,.] = asarray_key(mydict, loc), asarray_contents(mydict, loc);
}
return(sort(A,1..cols(id)))
}
end
// working example
sysuse auto, clear
mata array_runningsum(("foreign", "mpg"), "price")
// compare to:
sysuse auto, clear
collapse (sum) price, by(foreign mpg)
list
************** end code ************************
Thank you,
Andrew Maurer
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/