Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: the fastest way to check if unique values of a variable > 100
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: the fastest way to check if unique values of a variable > 100
Date
Wed, 28 Aug 2013 13:12:48 -0400
Thanks, Daniel, Joe.
I think Mata can solve this without sorting, though -tab- is still
very fast to abort if we talk about many-many values. But if it would
still tab them, our little code can break sooner (esp. if the data
still has many obs per still-tabbable categories):
* testing variable has unique values less than a number
program uniquelessthan
version 11
syntax varname [if] [in], Uniq(integer)
marksample touse
mata: st_numscalar("rc",ult("`varlist'","`touse'",`uniq'))
if (rc) error 134
end
mata
real scalar ult(string var,string touse,real scalar unq){
real colvector v, unqs
real scalar i, ulen
ulen = 0 // just an even simpler counter for the rows of unqs
v = J(0,1,.)
st_view(v,.,var,touse)
for (i=1; i<=rows(v); i++){
if (anyof(unqs,v[i])) continue
unqs = unqs \ v[i]
if (++ulen==unq) break
}
return(ulen==unq)
}
end
On Wed, Aug 28, 2013 at 12:40 PM, Joe Canner <[email protected]> wrote:
> The requirement that the data set be sorted before calling the Mata routine can be a significant hit with a large data set. Based on my benchmarks with an 8 million observation data set, the -tabulate- method is faster unless the number of unique values is quite high (not sure where the break-even point is but it is at least in the several hundreds).
>
> Also, this routine only works for numeric variables. For something that does both string and numeric variables you could use something like:
>
> prog unique
> syntax [varlist] [in] [if]
> qui levelsof `varlist' `in' `if', local(lev)
>
> local n=0
> foreach lev in `lev' {
> local ++n
> if `n'>100 {
> continue, break
> }
> }
>
> di "`n'"
> end
>
> Since -levelsof- uses -tabulate- for numeric variables, the performance for numeric variables is the similar to that for -tabulate- (for reasonable numbers of unique values). And, since -levelsof- uses a different method for string variables (basically a -bysort-), it is faster than -tabulate- when you have string variables with a large number of unique values.
>
> Regards,
> Joe
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of daniel klein
> Sent: Tuesday, August 27, 2013 5:46 AM
> To: [email protected]
> Subject: Re: st: the fastest way to check if unique values of a variable > 100
>
> As I metioned, I did not get what you are trying to do. To me your original post sounded like you wanted to check whether any value of a variable is larger than 100. Re-reading your post now it is
> (hopefully) clear that you want to check whether a given variable has more than 100 unique values. Sorry for the missunderstanding.
>
> How about this simple Mata approach?
>
> m :
> real scalar muniq(string scalar varn, real scalar brk) {
> real rowvector x
> real scalar u
>
> st_view(x, . ,varn)
>
> u = 1
> for (i = 2; i <= rows(x); ++i) {
> if (x[i, 1] != x[i - 1, 1]) ++u
> if (u >= brk) break
> }
>
> return(u)
> }
> end
>
> Here is a timed example compared to -tabulate-
>
> // example
> sysuse auto ,clear
> expand 10000
>
> sort price
>
> timer clear
> timer on 1
> qui ta price
> di r(r)
> timer off 1
>
> timer on 2
> m : muniq("price", 74)
> timer off 2
>
> timer on 3
> m : muniq("price", 37)
> timer off 3
>
> timer list
> // end example
>
> As you see, the code should be almost as fast as -tabulate- if you are going through the maximum of possible unique values (74 in this case), and should be faster if you constrain the number of unique values to be found (to 37 in the example).
>
> Note that the Mata function currently requires the data to be -sort-ed on the respective variable to work. This needs some extra time (and one would want to integrate it into the code if one planned to make it a serious function or program), but I guess if you constrain the number of unique values to 100 you should still be faster than with -tabulate-.
>
> Best
> Daniel
> --
>
> I think your original one-liner was -cap as foo > 100 ,f-. This would check for 100 unique values only if the values can only be positive integers. Otherwise it leads to false positives (e.g. if I have only two values, but one is 2321) or falls negatives (if I have 2000 values of 0(0.01)19.99 ). That's what I meant.
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/