Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Andrew Maurer <Andrew.Maurer@qrm.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | RE: st: inlist() equivalent for mata? |
Date | Thu, 20 Mar 2014 21:42:26 +0000 |
Dear Bill, Thank you for the idea. I tried benchmarking the method you described against the original method of using an array that I posted. It seems like for small list size (rows(L)) the sum method is faster, but as soon as L length is greater than a few hundred, it is substantially slower (eg 26s vs 3.4s for L length of 2,000). I suspect that sum(b1 :==L) is internally looping through the elements of L. So with Bsize of 1 billion and Lsize of 100,000, it would effectively do a loop of 1 billion iterations over 100,000 elements. See code comparing methods below. I also tried a method where I create a mata object f3, f9, f23, etc for every element in L. I then loop through the elements of B and stata capture if the mata object exists. This method is the slowest of them all (although I'm not quite sure why. I would have thought that checking to see if scalar f1 exists, for example, would be fast.) Some benchmark results: Doing 100 repetitions per method, the cumulative times for 100 repetitions for: B 10,000 long and L 100 long: 2.25s array method, 1.79s sum method B 100,000 long and L 100 long: 21.76s array method, 17.78s sum method B 10,000 long and L 1,000 long: 2.86s array method, 12.62s sum method B 10,000 long and L 2,000 long: 3.41s array method, 26.05s sum method I still think that there should be a more efficient way to do this than the array method posted. The inefficiency comes from the collision resolution feature of asarray(). If I'm doing 1 billion iterations over anything, it's really important to have the "anything" be as quick as possible. I suspect that hash1'ing is fast, but the collision checking is slow. The initial fixed cost of creating a perfect hash table would likely be far less than the cost of collision-resolving 1 billion times, although I don't know enough about hash algorithms to know if it's possible to create a perfect hash from a given set of integers. If this were possible, some common commands in Stata, like collapse and egen, could be made much faster for large datasets, by avoiding sorting. Andrew Maurer ********** Code ***************** /* 1) Define inlist_arraymethod() 2) Define inlist_fmethod() 3) Define inlist_summethod() 4) Benchmark methods with timers and simulated data */ clear all ************ define inlist_arraymethod() ********************* cap mata mata drop inlist_arraymethod() mata real colvector inlist_arraymethod(real colvector B, real colvector L) { // initialize matrices real matrix M real colvector R // create the map, M M = asarray_create("real", 1, rows(L)) asarray_notfound(M, 0) for (i=1; i<=rows(L); i++) { asarray(M, L[i], 1) } // create vector to be returned R = J(rows(B),1,.) for (i=1; i<=rows(B); i++) { R[i] = asarray(M, B[i]) } return(R) } end ************ end inlist_arraymethod() definition ************* ************ define inlist_fmethod() ********************* mata real colvector inlist_fmethod(real colvector B, real colvector L) { // initialize matrices real colvector R // create a mata object for every element of L for (i=1;i<=rows(L);i++) { stata("mata f" + strofreal(L[i]) + "=1") } // create vector to be returned R = J(rows(B),1,.) for (i=1; i<=rows(B); i++) { stata("cap mata assert(f" + strofreal(B[i]) + ")") stata("local rc = _rc") R[i] = (st_local("rc") == "0") } return(R) } end ************ end inlist_fmethod() definition ************* ************ define inlist_summethod() ********************* mata real colvector inlist_summethod(real colvector B, real colvector L) { // initialize matrices real colvector R // create vector to be returned R = J(rows(B),1,.) for (i=1; i<=rows(B); i++) { R[i] = sum(B[i] :==(L)) } return(R) } end ************ end inlist_summethod() definition ************* ********** Benchmark inlist() methods ***************** /* For vector B, we want to generate a new vector, R, with the ith row equal to 1 if bi is in L, otherwise 0. */ local Bsize 10000 local Lsize 2000 local reps 100 // increase this to lower variance of timers set obs `Bsize' gen B = ceil(100000*runiform()) gen L = ceil(100000*runiform()) in 1/`Lsize' // only keep unique values in L sort L replace L = . if L[_n] == L[_n-1] local t = 0 foreach m in _arraymethod _summethod { // leave _fmethod out if L >= 100,000. It starts to go very slow local ++t forval i = 1/`reps' { mata B = st_data(.,"B") mata L = st_data(.,"L",0) timer on `t' mata R = inlist`m'(B,L) timer off `t' } } timer list exit ********** End benchmark ****************************** ********** End code ************* -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of William Gould, StataCorp LP Sent: Thursday, March 20, 2014 10:07 AM To: statalist@hsphsun2.harvard.edu Subject: Re: st: inlist() equivalent for mata? Andrew Maurer <Andrew.Maurer@qrm.com> writes, > I am trying to find a mata equivalent of Stata's inlist() function, but > have not found anything similar when looking through the directory of > built-in mata functions. How about sum(x :== (x1, x2, ..., xN)) This works with numbers, sum(x := (1, 5, 7, 9, 11)) and it works with strings, sum(s := ("this", "that", "whatever")) sum(x := x1, x2, ..., Xn)) returns the number of matches found. If x1, x2, ..., Xn are distinct, it returns 1 if a match if found, and returns 0 otherwise. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/