Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: inlist() equivalent for mata?

From	Andrew Maurer <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: inlist() equivalent for mata?
Date	Thu, 20 Mar 2014 21:42:26 +0000

Dear Bill,

Thank you for the idea. I tried benchmarking the method you described against the original method of using an array that I posted. It seems like for small list size (rows(L)) the sum method is faster, but as soon as L length is greater than a few hundred, it is substantially slower (eg 26s vs 3.4s for L length of 2,000). I suspect that sum(b1 :==L) is internally looping through the elements of L. So with Bsize of 1 billion and Lsize of 100,000, it would effectively do a loop of 1 billion iterations over 100,000 elements. See code comparing methods below.

I also tried a method where I create a mata object f3, f9, f23, etc for every element in L. I then loop through the elements of B and stata capture if the mata object exists. This method is the slowest of them all (although I'm not quite sure why. I would have thought that checking to see if scalar f1 exists, for example, would be fast.)

Some benchmark results:
Doing 100 repetitions per method, the cumulative times for 100 repetitions for:
B 10,000 long and L 100 long: 2.25s array method, 1.79s sum method
B 100,000 long and L 100 long: 21.76s array method, 17.78s sum method
B 10,000 long and L 1,000 long: 2.86s array method, 12.62s sum method
B 10,000 long and L 2,000 long: 3.41s array method, 26.05s sum method

I still think that there should be a more efficient way to do this than the array method posted. The inefficiency comes from the collision resolution feature of asarray(). If I'm doing 1 billion iterations over anything, it's really important to have the "anything" be as quick as possible. I suspect that hash1'ing is fast, but the collision checking is slow. The initial fixed cost of creating a perfect hash table would likely be far less than the cost of collision-resolving 1 billion times, although I don't know enough about hash algorithms to know if it's possible to create a perfect hash from a given set of integers. If this were possible, some common commands in Stata, like collapse and egen, could be made much faster for large datasets, by avoiding sorting.

Andrew Maurer 



********** Code *****************
/*
1) Define inlist_arraymethod()
2) Define inlist_fmethod()
3) Define inlist_summethod()
4) Benchmark methods with timers and simulated data
*/

clear all

************ define inlist_arraymethod() *********************
cap mata mata drop inlist_arraymethod()
mata

real colvector inlist_arraymethod(real colvector B, real colvector L)
{

// initialize matrices
real matrix M
real colvector R

// create the map, M
M = asarray_create("real", 1, rows(L))
asarray_notfound(M, 0)

for (i=1; i<=rows(L); i++) {
	asarray(M, L[i], 1)
}

// create vector to be returned
R = J(rows(B),1,.)
for (i=1; i<=rows(B); i++) {
	R[i] = asarray(M, B[i])
}

return(R)

}

end
************ end inlist_arraymethod() definition *************

************ define inlist_fmethod() *********************
mata
real colvector inlist_fmethod(real colvector B, real colvector L)
{

// initialize matrices
real colvector R

// create a mata object for every element of L
for (i=1;i<=rows(L);i++) {
	stata("mata f" + strofreal(L[i]) + "=1")
}

// create vector to be returned
R = J(rows(B),1,.)
for (i=1; i<=rows(B); i++) {
	stata("cap mata assert(f" + strofreal(B[i]) + ")")
	stata("local rc = _rc")
	R[i] = (st_local("rc") == "0")
}

return(R)

}
end
************ end inlist_fmethod() definition *************

************ define inlist_summethod() *********************
mata
real colvector inlist_summethod(real colvector B, real colvector L)
{

// initialize matrices
real colvector R

// create vector to be returned
R = J(rows(B),1,.)
for (i=1; i<=rows(B); i++) {
	R[i] = sum(B[i] :==(L))
}

return(R)

}
end
************ end inlist_summethod() definition *************


********** Benchmark inlist() methods *****************
/*
For vector B, we want to generate a new vector, R, with the ith row
equal to 1 if bi is in L, otherwise 0.
*/
local Bsize 10000
local Lsize 2000
local reps 100 // increase this to lower variance of timers

set obs `Bsize'

gen B = ceil(100000*runiform())
gen L = ceil(100000*runiform()) in 1/`Lsize'

// only keep unique values in L
sort L
replace L = . if L[_n] == L[_n-1]

local t = 0
foreach m in _arraymethod  _summethod { // leave _fmethod out if L >= 100,000. It starts to go very slow

	local ++t

	forval i = 1/`reps' {
		mata B = st_data(.,"B")
		mata L = st_data(.,"L",0)

		timer on `t'
		mata R = inlist`m'(B,L)
		timer off `t'
	}
	
}

timer list

exit
********** End benchmark ******************************
********** End code *************






-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of William Gould, StataCorp LP
Sent: Thursday, March 20, 2014 10:07 AM
To: [email protected]
Subject: Re: st: inlist() equivalent for mata?

Andrew Maurer <[email protected]> writes, 

> I am trying to find a mata equivalent of Stata's inlist() function, but
> have not found anything similar when looking through the directory of
> built-in mata functions. 

How about 

        sum(x :== (x1, x2, ..., xN))

This works with numbers, 

	sum(x := (1, 5, 7, 9, 11))

and it works with strings, 

        sum(s := ("this", "that", "whatever"))

sum(x := x1, x2, ..., Xn)) returns the number of matches found. 
If x1, x2, ..., Xn are distinct, it returns 1 if a match if found, 
and returns 0 otherwise. 

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: inlist() equivalent for mata?
  - From: Phil Schumm <[email protected]>

References:
- Re: st: inlist() equivalent for mata?
  - From: "William Gould, StataCorp LP" <[email protected]>

Prev by Date: Re: st: How to graph
Next by Date: Re: st: gllamm - multilevel models and complex samples
Previous by thread: Re: st: inlist() equivalent for mata?
Next by thread: Re: st: inlist() equivalent for mata?
Index(es):
- Date
- Thread