Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: How to perfom very simple manipulations in large data sets more efficiently
From
Billy Schwartz <[email protected]>
To
[email protected]
Subject
Re: st: How to perfom very simple manipulations in large data sets more efficiently
Date
Mon, 15 Aug 2011 15:45:57 -0400
I have encountered this problem a lot, and responses from Statalist
(including Nick) helped me a few weeks ago.
Using the summarize command should be quite fast because it's a
built-in (machine-implemented) command, but when I have typically
needed to find the value of X corresponding to the smallest value of Y
(as in your example) I typically have to do it over a grouping
variable using -by-.
This brings me to my first suggestion: If there is a way to reduce
your 10,000 repetitions to one pass by marking out each part of the
dataset you need to repeat in with a grouping variable (check out
-egen- and its group function), using -by- can solve most of your
problems in one fell swoop.
Next, the real algorithmic question here is how to identify an minimum
value. That's what Statalisters helped with a few weeks ago. Using
egen's min function uses your "simple approach 2": sort and take the
the value of Y[1]. This is SLOW. Better to find the minimum manually.
The example below uses -by- but you can do precisely the same thing by
dropping the -by- syntax.
/* example 1 */
clonevar minY = Y
/* by groupid: replace this value of minY with the previous one if the
previous one is less or the previous one is non missing and this one
is missing. no replacements if _n == 1 because minY[0] == . always and
minY > . sometimes */
by groupid, sort: replace minY = minY[_n-1] if minY[_n-1] < minY |
(minY[_n-1] < . & minY >= .) & _n > 1
by groupid: keep if Y == minY[_N]
Without a by-grouping you can also add the local command you had
before, as follows:
/*example 2*/
clonevar minY = Y
replace minY = minY[_n-1] if minY[_n-1] < minY | (minY[_n-1] < . &
minY >= .) & _n > 1
keep if Y == minY[_N]
local minY = minY[_N]
Finally, if you're sure you have no missing values in Y, you can
simplify the -replace- syntax as follows
/* example 3 fragment simplified */
replace minY = minY[_n-1] if minY[_n-1] < minY
On Mon, Aug 15, 2011 at 12:57 PM, Tiago V. Pereira
<[email protected]> wrote:
>
> I thank Stas and Nick for their helpful comments on my last query.
>
> All the best
>
> Tiago
>
> --
> Dear statalisters,
>
> I have to perform extremely simple tasks, but I am struggling with the low
> efficiency of my dummy implementations. Perhaps you might have smarter
> ideas.
>
> Here is an example:
>
> Suppose I have two variables, X and Y.
>
> I need to the get value of Y that is associated with the smallest value of X.
>
> What I usually do is:
>
> (1) simple approach 1
>
> */ ------ start --------
> sum X, meanonly
> keep if X==r(min)
> local my_value = Y[1]
> */ ------ end --------
>
> (2) simple approach 2
>
> */ ------ start --------
> sort X
> local my_value = Y[1]
> */ ------ end --------
>
> These approaches are simple, and work very well for small data sets. Now,
> I have to repeat that procedure 10k times, for data sets that range from
> 500k to 1000k observations. Hence, both procedures 1 and 2 become clearly
> slow.
>
> If you have any tips, I will be very grateful.
>
> All the best,
>
> Tiago
>
>
>
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/