Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: inlist() equivalent for mata?
From
Andrew Maurer <[email protected]>
To
"[email protected]" <[email protected]>
Subject
RE: st: inlist() equivalent for mata?
Date
Fri, 21 Mar 2014 16:42:16 +0000
Phil,
Thanks for correction. You're right, I meant N >> T. And thanks for the suggestion of anyof() - that's exactly what I was looking for.
However, I actually benchmarked _arraymethod, _anyof, and _vec methods. It turns out that while anyof is faster for small Lsize up to 1,000, it's actually slower than the other methods for L above 3000 or so and it's relative performance diminishes significantly as Lsize increases. (See benchmark times below)
Aljar,
Thank you for this solution. While it's more restrictive if max(B) or max(L) are large, it performs MUCH better than the other methods as Bsize or Lsize grow. I notice little change in computation time as B or L grow.
The only frustrating part is the inefficiency of storing the vectors. Every cell of the vector B and every cell of L is taking up 8 bytes, but will only ever hold a 1 or 0. So 1 billion rows of B will take up 10^9*8 bytes ~= 8GB in memory. Theoretically, shouldn't we be able to store 1 billion 1/0s as 1 billion bits, or 10^9/8 bytes ~= 125MB? I know little about memory allocation at such a low level, so I don't know if this would be possible.
It's too bad mata doesn't have the conception of datatypes that stata has. Ie, I haven't found a way to "byte matrix B; B = J(100,1,0)". I suppose one alternative would be to st_addvar B as a byte and set up a view to it from mata. This would at least lessen the memory requirement from 8GB to 1GB, but add the overhead of st_view(). Also, is there a reason Stata doesn't have a bit datatype? This would be useful for a huge amount of Stata commands in mark/mark sample'ing.
Benchmark results:
Doing 100 repetitions per method, changing Bsize and Lsize, the cumulative times for 100 repetitions are shown below. (Note - the distributions for B and L are ceil(100,000*runiform()). Increasing this by 10x increases _vec's time by 10x, but has little effect on the other two methods.)
Below: Bsize 10,000(5,000)20,000 and Lsize 1,000(2,000)10,000:
Bsize: 10000, Lsize: 1000
_arraymethod: 0.79
_anyof: 0.44
_vec: 1.28
Bsize: 10000, Lsize: 3000
_arraymethod: 1.23
_anyof: 1.09
_vec: 1.31
Bsize: 10000, Lsize: 5000
_arraymethod: 1.43
_anyof: 1.80
_vec: 1.17
Bsize: 10000, Lsize: 7000
_arraymethod: 1.73
_anyof: 2.55
_vec: 1.12
Bsize: 10000, Lsize: 1000
_arraymethod: 0.72
_anyof: 0.39
_vec: 1.12
Bsize: 10000, Lsize: 3000
_arraymethod: 1.06
_anyof: 1.08
_vec: 1.08
Bsize: 10000, Lsize: 5000
_arraymethod: 1.33
_anyof: 1.69
_vec: 1.16
Bsize: 10000, Lsize: 7000
_arraymethod: 1.69
_anyof: 2.36
_vec: 1.09
Bsize: 10000, Lsize: 9000
_arraymethod: 2.08
_anyof: 3.09
_vec: 1.25
Bsize: 15000, Lsize: 1000
_arraymethod: 1.00
_anyof: 0.62
_vec: 1.13
Bsize: 15000, Lsize: 3000
_arraymethod: 1.26
_anyof: 1.58
_vec: 1.19
Bsize: 15000, Lsize: 5000
_arraymethod: 1.79
_anyof: 2.56
_vec: 1.11
Bsize: 15000, Lsize: 7000
_arraymethod: 1.92
_anyof: 3.51
_vec: 1.14
Bsize: 15000, Lsize: 9000
_arraymethod: 2.31
_anyof: 4.56
_vec: 1.13
Bsize: 20000, Lsize: 1000
_arraymethod: 1.26
_anyof: 0.84
_vec: 1.15
Bsize: 20000, Lsize: 3000
_arraymethod: 1.72
_anyof: 2.05
_vec: 1.13
Bsize: 20000, Lsize: 5000
_arraymethod: 2.06
_anyof: 3.36
_vec: 1.14
Bsize: 20000, Lsize: 7000
_arraymethod: 2.37
_anyof: 4.70
_vec: 1.16
Bsize: 20000, Lsize: 9000
_arraymethod: 2.70
_anyof: 6.16
_vec: 1.19
Thank you very much to everyone for the comments and thoughtful responses.
Andrew Maurer
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Aljar Meesters
Sent: Friday, March 21, 2014 8:16 AM
To: [email protected]
Subject: Re: st: inlist() equivalent for mata?
For this implementation I assume that you only have positive integers.
It is pretty fast and scales nicely in N and T. A caveat is that it
consumes memory based on the largest integers in your list and vector.
Best,
Aljar
*** Begin ***
real colvector vec_inlist(real colvector B, real colvector L){
real colvector b, l
real scalar minrows
// Determine the elements in B
b = J(max(B), 1, 0)
b[B] = J(rows(B), 1, 1)
// Determine the elements in L
l = J(max(L), 1, 0)
l[L] = J(rows(L), 1, 1)
// Combine the combination
minrows = min((rows(b), rows(l)))
answer = J(rows(b), 1, 0)
answer[|1, 1 \ minrows, 1 |] = b[|1, 1 \ minrows, 1 |] :* l[|1, 1
\ minrows, 1 |]
// Reflect back to B
answer= answer[B]
return(answer)
}
*** End ***
2014-03-21 2:09 GMT+01:00 Phil Schumm <[email protected]>:
> On Mar 20, 2014, at 4:42 PM, Andrew Maurer <[email protected]> wrote:
>> So with Bsize of 1 billion and Lsize of 100,000, it would effectively do a loop of 1 billion iterations over 100,000 elements.
>
>
> In your original post, you said that "T >> N", where T is the length of list L, and N is the length of list B. This would appear to be the opposite of what you wrote above.
>
>
>> I still think that there should be a more efficient way to do this than the array method posted.
>
>
> The fastest (and simplest) option is to use -anyof()-, e.g.,
>
>
> real colvector inlist_anyof(real colvector B, real colvector L)
> {
> real scalar i
> real colvector R
>
> R = J(rows(B),1,0)
> for (i=1; i<=rows(B); i++) {
> if (anyof(L, B[i])) R[i] = 1
> }
>
> return(R)
> }
>
>
> -- Phil
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/