Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Efficiently using associative arrays in mata
From
Aljar Meesters <[email protected]>
To
[email protected]
Subject
Re: st: Efficiently using associative arrays in mata
Date
Thu, 6 Mar 2014 18:07:36 +0100
Dear Andrew,
I had some spare moments this day and I tried to write a simple but
fast implementation of a hash table myself, just for fun. The result
is a very rudimentary hash table that out-competes the asarray
implementation (but it also has way less functionality). For you this
may be interesting though. I have implemented your array_runningsum,
which you can find below.
Best,
Aljar
mata
*** Begin ***
/* This function finds a unique position for the key in vector keys
* and returns the position of the key if it is already added to keys
*/
real scalar findPos(keys, key){
N = rows(keys)
pos = hash1(key, N)
while(keys[pos, 1] :< . & keys[pos, .] != key)
pos = mod(pos, N) + 1
return(pos)
}
matrix array_runningsum2(idname, xname){
// load data into mata
st_view(id=., ., idname)
st_view(data=., ., xname)
/* tableSize is the size of the hash table
* and determines the tradeoff between speed
* and memory consumption.
* If you set it large, nothing will hapen but
* the function will consume a lot of memroy.
* If you set it too small (smaller than the number
* of groups, the function will hang in an infinite
* loop
* If you set it equal to the amount of groups,
* you will get minimal memory usage, yet, the
* performance is really low.
* I think that you get good performance if you
* set it 3 to 4 times the number of groups.
* If you set it equal to the number of rows in
* your dataset, what I did over here, ensures
* that the function always returns (well, I hope).
*/
tableSize = ceil(rows(id))
// Create a Keys matrix and a vector for the outcomes
keys = J(tableSize, 1, J(1, cols(id), .))
values = J(tableSize, 1, 0) // Initialized at zero
for(idx = 1; idx <= rows(id); ++idx){
pos = findPos(keys, id[idx, .])
keys[pos, .] = id[idx, .]
values[pos] = values[pos] + data[idx]
}
return(sort(select((keys, values), keys[.,1] :!= .), 1..cols(id)))
}
end
*** End ***
2014-03-06 11:07 GMT+01:00 Aljar Meesters <[email protected]>:
> The method that I have in mind uses the mata function
> -minindex(v,k,i,w)-, where v is a real vector that holds your unique
> group ids,k is in your case the number of groups that you have (but
> you can set this to missing as well), i, and w, are vectors in which
> the function puts information about group sizes and elements. I have
> added a proof of concept below this e-mail. There are however two
> caveats with this approach. Although the algorithm is, as far as I
> know, not disclosed, I think that it is an O(n * k) algorithm (where k
> the number of groups). If k is large relative to n, you get worse
> performance than the O(nlogn) of the sorting algorithm. The other
> caveat is that you have not one variable that identifies your groups
> but multiple. So, you have to combine these variables into one holds
> unique identifiers. If this is done incorrectly you will end up with
> non-unique identifiers. Also, the probability that things go wrong
> increases with the number of groups.
> As a side note, if you have a reasonable idea about how many groups
> you have, you may include this information when creating your hash
> table (third argument in asarray_create). This ensures that you do not
> get any rebuilds of the hash table while adding elements, which may
> increase your performance especially if you have a lot of groups.
> Best,
>
> Aljar
>
>
> *** begin example ***
> clear all
>
> set obs 100000
>
> gen x = ceil(runiform() * 1000)
> gen y = rnormal()
>
> timer on 1
> mata
> st_view(x=., ., "x y", .)
> minindex(x[., 1], ., i=., v=.)
> N = rows(v)
> v[.,2] = runningsum(v[., 2])
> t = J(N, 1, .)
> for(idx = 1; idx <= N; ++idx){
> idx_x = i[|v[idx, 1], 1 \ v[idx, 2], 1 |]
> t[idx] = mean(x[idx_x, 2])
> }
> end
> *** end example ***
>
> 2014-03-05 19:40 GMT+01:00 Joe Canner <[email protected]>:
>> Andrew,
>>
>> By default, MS Outlook removes "extra" line breaks in plain-text messages. You can restore them for individual messages by clicking where it says "Extra lines breaks in this message were removed", above the "From:" line.
>>
>> If you want to change the default, go to File->Options->Mail->Message format. See http://support.microsoft.com/kb/287816 for more details.
>>
>> Regards,
>> Joe Canner
>> Johns Hopkins University School of Medicine
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Andrew Maurer
>> Sent: Wednesday, March 05, 2014 12:53 PM
>> To: [email protected]
>> Subject: RE: st: Efficiently using associative arrays in mata
>>
>> Thanks very much for the response, Aljar. I tried your suggestion and found that the two-lookup method took about 35% longer in the below test (around 8.5 seconds vs 11.5 seconds), which isn't negligible..
>>
>> I'd be interested in hearing your suggestion on alternative code if the number of groups is small. I use this for a few different definitions of the id variable, but on a set of 1billion observations, the final number of observations may be 2000 at the low end or 10million at the high end.
>>
>> As an aside - I've noticed that lines of pasted code sometimes merge together when I email to statalist. Does anyone know why this happens? (messages sent using "plain text" format in Outlook).
>>
>> ************ Create sample data ************* // create sample data // variables: species, breed, animal_id, and date clear all local n 1000000 qui set obs `n'
>> gen byte species = int(4/`n' * (_n-1)) + 1 gen byte breed = int(5*runiform()) + 1 gen long animal_id = int((_n-1)/100) gen int date = mod(_n,100) + 500 gen x = runiform() sort species breed date qui save temp, replace
>> ************ End create sample data *********
>>
>> *********** Begin timed benchmark *********** clear all mata
>>
>> idname = ("species","breed","animal_id") xname = "x"
>>
>> // load data into mata
>> st_view(id=0, ., idname)
>> st_view(data=0, ., xname)
>>
>> // initialize array
>> mydic = asarray_create("real", cols(id)) asarray_notfound(mydic, 0)
>>
>> timer_clear()
>> for (j=1; j<=5; j++) {
>> timer_on(j)
>> // loop through observations
>> for (i=1; i<=rows(data); i++) {
>> thisid = id[i,.]
>> thisx = data[i,1]
>> asarray(mydic, thisid, 1)
>> }
>> timer_off(j)
>> }
>>
>> for (j=6; j<=10; j++) {
>> timer_on(j)
>> // loop through observations
>> for (i=1; i<=rows(data); i++) {
>> thisid = id[i,.]
>> thisx = data[i,1]
>> asarray(mydic, thisid, asarray(mydic, thisid) + thisx) }
>> timer_off(j)
>> }
>> timer()
>>
>> end
>> exit
>> *********** End timed benchmark *************
>>
>> *********** Output ***********
>> 1. 8.63 / 1 = 8.627
>> 2. 8.77 / 1 = 8.768
>> 3. 9.03 / 1 = 9.032
>> 4. 7.5 / 1 = 7.504
>> 5. 7.6 / 1 = 7.597
>> 6. 11.6 / 1 = 11.56
>> 7. 11.2 / 1 = 11.201
>> 8. 11.5 / 1 = 11.482
>> 9. 12 / 1 = 12.044
>> 10. 12 / 1 = 11.996
>> *********** End output *******
>>
>> Thank you,
>>
>> Andrew Maurer
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Aljar Meesters
>> Sent: Wednesday, March 05, 2014 10:22 AM
>> To: [email protected]
>> Subject: Re: st: Efficiently using associative arrays in mata
>>
>> Dear Andrew,
>>
>> I don't think that it possible to obtain the memory address of thisid in mydict with the standard implementation of asarray. Although it requires only a minor change in asarray to get this functionality, I don't think it is of great help for you. I have changed the line -asarray(mydict, thisid, asarray(mydict, thisid) + thisx)- into -asarray(mydict, thisid, 1)- to mimic the case you are looking for and although you get a speedup, it is not the speedup of a factor two you are looking for, but maybe on your system it does, so you may try this.
>> However, do you have some information about the number of observations after the collapse? If the number of groups is small I may have some code that is useful for you although it requires some changes in order to suit your usecase.
>> Best,
>>
>> Aljar
>>
>>
>> 2014-03-05 1:12 GMT+01:00 Andrew Maurer <[email protected]>:
>>> Hi Statalist,
>>>
>>> I am trying to write a function in mata that returns sums by groups (like collapse), but without sorting. I am interested in calculating sums over groups for very large datasets (eg, 1billion observations). I have run into issues with an individual "sort" command taking around 60min to complete.
>>>
>>> I suspect that since collapse internally uses a sort, computation time is O(nlogn), or something else worse than linear. However, using associative arrays, the process could be done in linear time (though I'd imagine with higher memory usage). Ie - loop once through the whole dataset, storing the running sum of each groupid separately. I have a working example below, but the "loop through observations" section is slower than expected (using collapse as a benchmark, it's 2-3x slower for up to 10million obs).
>>>
>>> I suspect that the issue is in the line:
>>> asarray(mydict, thisid, asarray(mydict, thisid) + thisx)
>>>
>>> The problem is that asarray() is called twice, so the memory address of thisid in mydict has to be looked up twice. Does anyone have any suggestions on a way to complete this task, only looking up the address of thisid once, without having to rewrite the asarray() function?
>>>
>>>
>>>
>>> ************** begin code ********************** clear all
>>>
>>> mata
>>> matrix array_runningsum(idname, xname) {
>>>
>>> // load data into mata
>>> st_view(id=0, ., idname)
>>> st_view(data=0, ., xname)
>>>
>>> // initialize array
>>> mydict = asarray_create("real", cols(id)) asarray_notfound(mydict, 0)
>>>
>>> // loop through observations and calculate running sum for (i=1;
>>> i<=rows(data); i++) {
>>> thisid = id[i,.]
>>> thisx = data[i,1]
>>> asarray(mydict, thisid, asarray(mydict, thisid) +
>>> thisx) }
>>>
>>> // convert array to matrix
>>> N = asarray_elements(mydict)
>>> A = J(N,(1+cols(id)),.)
>>> i = 1
>>>
>>> for (loc=asarray_first(mydict); loc!=NULL; loc=asarray_next(mydict, loc)) {
>>> A[i++,.] = asarray_key(mydict, loc),
>>> asarray_contents(mydict, loc); }
>>>
>>> return(sort(A,1..cols(id)))
>>> }
>>> end
>>>
>>> // working example
>>> sysuse auto, clear
>>> mata array_runningsum(("foreign", "mpg"), "price")
>>>
>>> // compare to:
>>> sysuse auto, clear
>>> collapse (sum) price, by(foreign mpg)
>>> list
>>> ************** end code ************************
>>>
>>>
>>> Thank you,
>>>
>>> Andrew Maurer
>>>
>>>
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/