Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Robert Picard <picard@netbox.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata? |
Date | Wed, 27 Jul 2011 16:34:23 -0400 |
You don't need to sort again to find the min or max within account date groups. Here's an example of how to find the maximum: * --------------------- begin example --------------------- version 11 clear * make up some data set obs 1000000 set seed 12345 gen account = int(10000 * runiform()) gen date = int(100 * runiform()) gen desc = int(10 * runiform()) gen random = runiform() * assume that the data is sorted by account date sort account date random set rmsg on * identify account date groups with desc == 2 generate desc2 = desc == 2 * no additional sorting needed clonevar mx = desc2 by account date: replace mx = mx[_n-1] if mx[_n-1] > mx & _n > 1 by account date: replace mx = mx[_N] * using sort is slower bysort account date (desc2): replace desc2 = desc2[_N] assert desc2 == mx * --------------------- end example ----------------------- On Wed, Jul 27, 2011 at 1:09 PM, Billy Schwartz <wkschwartz@gmail.com> wrote: > I'm wondering if there is a way to make finding max and min with -by- > fast by using Mata. I tend to work with large datasets -- around 10gb > per size -- big enough that many of the technicalities I wasn't > supposed to worry about when I first started on Stata like variable > datatypes, how frequently I read/write to disk, etc, really begin to > matter. And I have noticed more and more that what I do with much of > my time on Stata is waiting for Stata to finish sorting, usually so > that I can find a minimum or maximum value. Stata has a really fast > -sum()- function for use with -by:- but not an equivalent -max()- > function, so you have to sort and select. Sorting algorithms, though > fast, are not as fast as extrema-finding algorithms. > > For example, suppose I have panel data of bills by account and date, > and each bill has a description code for each line item on the bill > and an amount for each line item. Further, the dataset is sorted by > account date > > account date desc amount > ----------------------------------------------- > 1 1 1 5.95 > 1 1 3 2.94 > 1 2 1 5.95 > 1 2 2 9.45 > 1 2 3 3.00 > 2 3 7 6.22 > [etc] > > If I want to identify bills that contain item with description value > 2, the fastest, lowest-memory-overhead way I know to do it is > > . generate byte desc2 = desc == 2 > . bysort account date (desc2): replace desc2 = desc2[_N] > > If there were a max function that worked like the sum function (I'm > not talking about the one Stata currently has, which doesn't work like > this), I could avoid the sort, since as I said my data is already > sorted by account date, and write merely: > > . by account date: generate bye desc2 = max(desc == 2) > > Mata already has a fast (built-in) function to find max and min in a > vector, which I could use on an st_view() of my dataset. But how do I > get that to work with the by: I perform in Stata? > > -- > William Schwartz > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/