Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Efficiently using associative arrays in mata
From
Aljar Meesters <[email protected]>
To
[email protected]
Subject
Re: st: Efficiently using associative arrays in mata
Date
Wed, 5 Mar 2014 17:21:30 +0100
Dear Andrew,
I don't think that it possible to obtain the memory address of thisid
in mydict with the standard implementation of asarray. Although it
requires only a minor change in asarray to get this functionality, I
don't think it is of great help for you. I have changed the line
-asarray(mydict, thisid, asarray(mydict, thisid) + thisx)- into
-asarray(mydict, thisid, 1)- to mimic the case you are looking for and
although you get a speedup, it is not the speedup of a factor two you
are looking for, but maybe on your system it does, so you may try
this.
However, do you have some information about the number of observations
after the collapse? If the number of groups is small I may have some
code that is useful for you although it requires some changes in order
to suit your usecase.
Best,
Aljar
2014-03-05 1:12 GMT+01:00 Andrew Maurer <[email protected]>:
> Hi Statalist,
>
> I am trying to write a function in mata that returns sums by groups (like collapse), but without sorting. I am interested in calculating sums over groups for very large datasets (eg, 1billion observations). I have run into issues with an individual "sort" command taking around 60min to complete.
>
> I suspect that since collapse internally uses a sort, computation time is O(nlogn), or something else worse than linear. However, using associative arrays, the process could be done in linear time (though I'd imagine with higher memory usage). Ie - loop once through the whole dataset, storing the running sum of each groupid separately. I have a working example below, but the "loop through observations" section is slower than expected (using collapse as a benchmark, it's 2-3x slower for up to 10million obs).
>
> I suspect that the issue is in the line:
> asarray(mydict, thisid, asarray(mydict, thisid) + thisx)
>
> The problem is that asarray() is called twice, so the memory address of thisid in mydict has to be looked up twice. Does anyone have any suggestions on a way to complete this task, only looking up the address of thisid once, without having to rewrite the asarray() function?
>
>
>
> ************** begin code **********************
> clear all
>
> mata
> matrix array_runningsum(idname, xname) {
>
> // load data into mata
> st_view(id=0, ., idname)
> st_view(data=0, ., xname)
>
> // initialize array
> mydict = asarray_create("real", cols(id))
> asarray_notfound(mydict, 0)
>
> // loop through observations and calculate running sum
> for (i=1; i<=rows(data); i++) {
> thisid = id[i,.]
> thisx = data[i,1]
> asarray(mydict, thisid, asarray(mydict, thisid) + thisx)
> }
>
> // convert array to matrix
> N = asarray_elements(mydict)
> A = J(N,(1+cols(id)),.)
> i = 1
>
> for (loc=asarray_first(mydict); loc!=NULL; loc=asarray_next(mydict, loc)) {
> A[i++,.] = asarray_key(mydict, loc), asarray_contents(mydict, loc);
> }
>
> return(sort(A,1..cols(id)))
> }
> end
>
> // working example
> sysuse auto, clear
> mata array_runningsum(("foreign", "mpg"), "price")
>
> // compare to:
> sysuse auto, clear
> collapse (sum) price, by(foreign mpg)
> list
> ************** end code ************************
>
>
> Thank you,
>
> Andrew Maurer
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/