Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: the fastest way to check if unique values of a variable > 100
From
daniel klein <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: the fastest way to check if unique values of a variable > 100
Date
Tue, 27 Aug 2013 11:45:41 +0200
As I metioned, I did not get what you are trying to do. To me your
original post sounded like you wanted to check whether any value of a
variable is larger than 100. Re-reading your post now it is
(hopefully) clear that you want to check whether a given variable has
more than 100 unique values. Sorry for the missunderstanding.
How about this simple Mata approach?
m :
real scalar muniq(string scalar varn, real scalar brk)
{
real rowvector x
real scalar u
st_view(x, . ,varn)
u = 1
for (i = 2; i <= rows(x); ++i) {
if (x[i, 1] != x[i - 1, 1]) ++u
if (u >= brk) break
}
return(u)
}
end
Here is a timed example compared to -tabulate-
// example
sysuse auto ,clear
expand 10000
sort price
timer clear
timer on 1
qui ta price
di r(r)
timer off 1
timer on 2
m : muniq("price", 74)
timer off 2
timer on 3
m : muniq("price", 37)
timer off 3
timer list
// end example
As you see, the code should be almost as fast as -tabulate- if you are
going through the maximum of possible unique values (74 in this case),
and should be faster if you constrain the number of unique values to
be found (to 37 in the example).
Note that the Mata function currently requires the data to be -sort-ed
on the respective variable to work. This needs some extra time (and
one would want to integrate it into the code if one planned to make it
a serious function or program), but I guess if you constrain the
number of unique values to 100 you should still be faster than with
-tabulate-.
Best
Daniel
--
I think your original one-liner was -cap as foo > 100 ,f-. This would
check for 100 unique values only if the values can only be positive
integers. Otherwise it leads to false positives (e.g. if I have only
two values, but one is 2321) or falls negatives (if I have 2000 values
of 0(0.01)19.99 ). That's what I meant.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/