Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: inlist() equivalent for mata?
From
Andrew Maurer <[email protected]>
To
"[email protected]" <[email protected]>
Subject
RE: st: inlist() equivalent for mata?
Date
Thu, 20 Mar 2014 21:42:26 +0000
Dear Bill,
Thank you for the idea. I tried benchmarking the method you described against the original method of using an array that I posted. It seems like for small list size (rows(L)) the sum method is faster, but as soon as L length is greater than a few hundred, it is substantially slower (eg 26s vs 3.4s for L length of 2,000). I suspect that sum(b1 :==L) is internally looping through the elements of L. So with Bsize of 1 billion and Lsize of 100,000, it would effectively do a loop of 1 billion iterations over 100,000 elements. See code comparing methods below.
I also tried a method where I create a mata object f3, f9, f23, etc for every element in L. I then loop through the elements of B and stata capture if the mata object exists. This method is the slowest of them all (although I'm not quite sure why. I would have thought that checking to see if scalar f1 exists, for example, would be fast.)
Some benchmark results:
Doing 100 repetitions per method, the cumulative times for 100 repetitions for:
B 10,000 long and L 100 long: 2.25s array method, 1.79s sum method
B 100,000 long and L 100 long: 21.76s array method, 17.78s sum method
B 10,000 long and L 1,000 long: 2.86s array method, 12.62s sum method
B 10,000 long and L 2,000 long: 3.41s array method, 26.05s sum method
I still think that there should be a more efficient way to do this than the array method posted. The inefficiency comes from the collision resolution feature of asarray(). If I'm doing 1 billion iterations over anything, it's really important to have the "anything" be as quick as possible. I suspect that hash1'ing is fast, but the collision checking is slow. The initial fixed cost of creating a perfect hash table would likely be far less than the cost of collision-resolving 1 billion times, although I don't know enough about hash algorithms to know if it's possible to create a perfect hash from a given set of integers. If this were possible, some common commands in Stata, like collapse and egen, could be made much faster for large datasets, by avoiding sorting.
Andrew Maurer
********** Code *****************
/*
1) Define inlist_arraymethod()
2) Define inlist_fmethod()
3) Define inlist_summethod()
4) Benchmark methods with timers and simulated data
*/
clear all
************ define inlist_arraymethod() *********************
cap mata mata drop inlist_arraymethod()
mata
real colvector inlist_arraymethod(real colvector B, real colvector L)
{
// initialize matrices
real matrix M
real colvector R
// create the map, M
M = asarray_create("real", 1, rows(L))
asarray_notfound(M, 0)
for (i=1; i<=rows(L); i++) {
asarray(M, L[i], 1)
}
// create vector to be returned
R = J(rows(B),1,.)
for (i=1; i<=rows(B); i++) {
R[i] = asarray(M, B[i])
}
return(R)
}
end
************ end inlist_arraymethod() definition *************
************ define inlist_fmethod() *********************
mata
real colvector inlist_fmethod(real colvector B, real colvector L)
{
// initialize matrices
real colvector R
// create a mata object for every element of L
for (i=1;i<=rows(L);i++) {
stata("mata f" + strofreal(L[i]) + "=1")
}
// create vector to be returned
R = J(rows(B),1,.)
for (i=1; i<=rows(B); i++) {
stata("cap mata assert(f" + strofreal(B[i]) + ")")
stata("local rc = _rc")
R[i] = (st_local("rc") == "0")
}
return(R)
}
end
************ end inlist_fmethod() definition *************
************ define inlist_summethod() *********************
mata
real colvector inlist_summethod(real colvector B, real colvector L)
{
// initialize matrices
real colvector R
// create vector to be returned
R = J(rows(B),1,.)
for (i=1; i<=rows(B); i++) {
R[i] = sum(B[i] :==(L))
}
return(R)
}
end
************ end inlist_summethod() definition *************
********** Benchmark inlist() methods *****************
/*
For vector B, we want to generate a new vector, R, with the ith row
equal to 1 if bi is in L, otherwise 0.
*/
local Bsize 10000
local Lsize 2000
local reps 100 // increase this to lower variance of timers
set obs `Bsize'
gen B = ceil(100000*runiform())
gen L = ceil(100000*runiform()) in 1/`Lsize'
// only keep unique values in L
sort L
replace L = . if L[_n] == L[_n-1]
local t = 0
foreach m in _arraymethod _summethod { // leave _fmethod out if L >= 100,000. It starts to go very slow
local ++t
forval i = 1/`reps' {
mata B = st_data(.,"B")
mata L = st_data(.,"L",0)
timer on `t'
mata R = inlist`m'(B,L)
timer off `t'
}
}
timer list
exit
********** End benchmark ******************************
********** End code *************
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of William Gould, StataCorp LP
Sent: Thursday, March 20, 2014 10:07 AM
To: [email protected]
Subject: Re: st: inlist() equivalent for mata?
Andrew Maurer <[email protected]> writes,
> I am trying to find a mata equivalent of Stata's inlist() function, but
> have not found anything similar when looking through the directory of
> built-in mata functions.
How about
sum(x :== (x1, x2, ..., xN))
This works with numbers,
sum(x := (1, 5, 7, 9, 11))
and it works with strings,
sum(s := ("this", "that", "whatever"))
sum(x := x1, x2, ..., Xn)) returns the number of matches found.
If x1, x2, ..., Xn are distinct, it returns 1 if a match if found,
and returns 0 otherwise.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/