Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: RE: How to perfom very simple manipulations in large data sets more efficiently
From
"Tiago V. Pereira" <[email protected]>
To
[email protected]
Subject
st: RE: How to perfom very simple manipulations in large data sets more efficiently
Date
Tue, 16 Aug 2011 10:29:44 -0300 (BRT)
Billy,
Extremely helpful tips! Thanks a lot!
Cheers!
Tiago
--
I have encountered this problem a lot, and responses from Statalist
(including Nick) helped me a few weeks ago.
Using the summarize command should be quite fast because it's a
built-in (machine-implemented) command, but when I have typically
needed to find the value of X corresponding to the smallest value of Y
(as in your example) I typically have to do it over a grouping
variable using -by-.
This brings me to my first suggestion: If there is a way to reduce
your 10,000 repetitions to one pass by marking out each part of the
dataset you need to repeat in with a grouping variable (check out
-egen- and its group function), using -by- can solve most of your
problems in one fell swoop.
Next, the real algorithmic question here is how to identify an minimum
value. That's what Statalisters helped with a few weeks ago. Using
egen's min function uses your "simple approach 2": sort and take the
the value of Y[1]. This is SLOW. Better to find the minimum manually.
The example below uses -by- but you can do precisely the same thing by
dropping the -by- syntax.
/* example 1 */
clonevar minY = Y
/* by groupid: replace this value of minY with the previous one if the
previous one is less or the previous one is non missing and this one
is missing. no replacements if _n == 1 because minY[0] == . always and
minY > . sometimes */
by groupid, sort: replace minY = minY[_n-1] if minY[_n-1] < minY |
(minY[_n-1] < . & minY >= .) & _n > 1
by groupid: keep if Y == minY[_N]
Without a by-grouping you can also add the local command you had
before, as follows:
/*example 2*/
clonevar minY = Y
replace minY = minY[_n-1] if minY[_n-1] < minY | (minY[_n-1] < . &
minY >= .) & _n > 1
keep if Y == minY[_N]
local minY = minY[_N]
Finally, if you're sure you have no missing values in Y, you can
simplify the -replace- syntax as follows
/* example 3 fragment simplified */
replace minY = minY[_n-1] if minY[_n-1] < minY
--
I thank Stas and Nick for their helpful comments on my last query.
All the best
Tiago
--
Dear statalisters,
I have to perform extremely simple tasks, but I am struggling with the low
efficiency of my dummy implementations. Perhaps you might have smarter
ideas.
Here is an example:
Suppose I have two variables, X and Y.
I need to the get value of Y that is associated with the smallest value of X.
What I usually do is:
(1) simple approach 1
*/ ------ start --------
sum X, meanonly
keep if X==r(min)
local my_value = Y[1]
*/ ------ end --------
(2) simple approach 2
*/ ------ start --------
sort X
local my_value = Y[1]
*/ ------ end --------
These approaches are simple, and work very well for small data sets. Now,
I have to repeat that procedure 10k times, for data sets that range from
500k to 1000k observations. Hence, both procedures 1 and 2 become clearly
slow.
If you have any tips, I will be very grateful.
All the best,
Tiago
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/