Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: inlist() equivalent for mata?
From
Andrew Maurer <[email protected]>
To
"'[email protected]'" <[email protected]>
Subject
RE: st: inlist() equivalent for mata?
Date
Fri, 21 Mar 2014 17:01:19 +0000
Bill - thank you for this thorough explanation.
Just to clarify, the "array approach" refers to the code I included earlier in the thread at the bottom of the message sent 3/20/2014 4:42 PM CST, not that this matters though.
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of William Gould, StataCorp LP
Sent: Friday, March 21, 2014 11:42 AM
To: [email protected]
Subject: Re: st: inlist() equivalent for mata?
Andrew Maurer <[email protected]> wrote a response to my last
posting on his problem.
What follows is long, perhaps entertaining, but not really useful
for helping Andrew with this problem. It is perhaps informative,
however, in learning how to think about these kinds of problems.
In addition, I might be misinterpreting what Andrew means
means by the "array approach", but that doesn't matter. I'm
going to assume that "array approach" means "asarray() approach".
Perhaps Andrew means a real array approach. If so, that would change
a word here and there in what follows, but not in how I'm thinking
about the problem.
Andrew writes,
> I tried benchmarking the method you described against the original
> method of using an array that I posted. It seems like for small list
> size (rows(L)) the sum method is faster, but as soon as L length is
> greater than a few hundred , it is substantially slower (eg 26s vs
> 3.4s for L length of 2,000). I suspect that sum(b1 :==L) is
> internally looping through the elements of L. So with Bsize of 1
> billion and Lsize of 100,000, it would effectively do a loop of 1
> billion iterations over 100,000 elements. See code comparing methods
> below.
Your speculation as to cause is spot on. sum() loops over all the elements
rather than stopping immediately if the element is found. Thus, on
any given SUCCESSFUL search, one expects the sum() approach to execute
twice as many computer operations than a stop-on-success method would
execute. Here's the reasoning,
1. Let N_a denote be the number of values being searched over, which
values I put in the vector.
2. The number of of computer operations required by the sum()
approach is hus proportional to N_a. Let's call the proportionality
factor m1, so the number of computer operations required is
m1*N_a + f1, where f1 is the fixed time-cost of starting the
search.
3. Andrew implemented a stop-on-success approach using Mata.
Assuming Andrew used asarray(), the execution time
will be m2*g(N_a)+f2. Describing g() is difficult, except
that g(N_a)<N_a for all values of N_a and, for asarray(),
dg()/dN_a < 1.
In addition, Assuming Andrew is using asarray(), f2>f1.
4. I implemented my approach using Mata's built-in function.
Andrew wrote Mata code. Thus we would expect m1 < m2.
Moreover, my approach is operation-wise simpler than the
asarray() approach used by Andrew, which is yet another reason
why m1 should be less than m2.
4. Thus, m1*N_a + f1 < m2*g(N_a) + f2 for small N.
As N increases, these conditions lead to the certainty that
the asarray() approach will eventually run faster than the
sum() approach.
> I also tried a method where I create a mata object f3, f9, f23, etc
> for every element in L. I then loop through the elements of B and
> stata capture if the mata object exists. This method is the slowest
> of them all (although I'm not quite sure why. I wo uld have thought
> that checking to see if scalar f1 exists, for example, would be
> fast.)
No, checking whether an object exists is slow. It requires searching
tables of objects that do exist. Stata does that kind of thing
constantly. Mata, however, is a compiler, and existing things have an
address, addresses are resolved when the routine is compiled, and at
execution time, Mata goes right to the desired object. When you ask
whether an object exists, you are forcing Mata to behave like Stata's
ado language. Andrew has just discovered an important reason why Mata
is so much faster than Stata's ado.
> I suspect that hash1'ing is fast, but the collision checking is slow.
> The initial fixed cost of creating a perfect hash table would likely
> be far less than the cost of collision-resolving 1 billion times,
> although I don't know enough about hash algorithms to know if it's
> possible to create a perfect hash from a given set of integers.
hasing1'ing is fast.
There is no such thing as a perfect hash function in the sense
that the probability of collision is ever 0. A hash function is
defined as follows
Let U be a set of POSSIBLE values. Let u represent a member of
set.
Hash function h = H_n(u, n) is defined over U such that
1 <= h <= n, h an integer, where n is a number you may
choose. The larger is n, the lower will be p, the probability
that two different u's will have the same H_n() value.
It is not required, however, that p ever be zero.
It is not required that p ever go to 0 because hash functions are
supposed to be general, and U is really the universe of possible values
over all possible problems. There are other solutions for
small-universe problems.
While there is no perfect hash function, some functions are better for
some problems than others. hash1() is known to be pretty good over a
wide range of problems.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/