Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?
Date
Thu, 28 Jul 2011 10:47:34 +0100
For the record, this is a little slower. Calling up -max(,)- has some
overhead over doing it directly as Robert does. But it is still much
faster than the method with re-sort-ing.
* third method
clonevar mx2 = desc2
by account date : replace mx2 = max(mx2[_n-1], desc2)
by account date: replace mx2 = mx2[_N]
Nick
On Wed, Jul 27, 2011 at 9:34 PM, Robert Picard <[email protected]> wrote:
> You don't need to sort again to find the min or max within account
> date groups. Here's an example of how to find the maximum:
>
> * --------------------- begin example ---------------------
> version 11
> clear
>
> * make up some data
> set obs 1000000
> set seed 12345
> gen account = int(10000 * runiform())
> gen date = int(100 * runiform())
> gen desc = int(10 * runiform())
> gen random = runiform()
>
> * assume that the data is sorted by account date
> sort account date random
>
> set rmsg on
>
> * identify account date groups with desc == 2
> generate desc2 = desc == 2
>
> * no additional sorting needed
> clonevar mx = desc2
> by account date: replace mx = mx[_n-1] if mx[_n-1] > mx & _n > 1
> by account date: replace mx = mx[_N]
>
> * using sort is slower
> bysort account date (desc2): replace desc2 = desc2[_N]
>
> assert desc2 == mx
>
> * --------------------- end example -----------------------
>
>
> On Wed, Jul 27, 2011 at 1:09 PM, Billy Schwartz <[email protected]> wrote:
>> I'm wondering if there is a way to make finding max and min with -by-
>> fast by using Mata. I tend to work with large datasets -- around 10gb
>> per size -- big enough that many of the technicalities I wasn't
>> supposed to worry about when I first started on Stata like variable
>> datatypes, how frequently I read/write to disk, etc, really begin to
>> matter. And I have noticed more and more that what I do with much of
>> my time on Stata is waiting for Stata to finish sorting, usually so
>> that I can find a minimum or maximum value. Stata has a really fast
>> -sum()- function for use with -by:- but not an equivalent -max()-
>> function, so you have to sort and select. Sorting algorithms, though
>> fast, are not as fast as extrema-finding algorithms.
>>
>> For example, suppose I have panel data of bills by account and date,
>> and each bill has a description code for each line item on the bill
>> and an amount for each line item. Further, the dataset is sorted by
>> account date
>>
>> account date desc amount
>> -----------------------------------------------
>> 1 1 1 5.95
>> 1 1 3 2.94
>> 1 2 1 5.95
>> 1 2 2 9.45
>> 1 2 3 3.00
>> 2 3 7 6.22
>> [etc]
>>
>> If I want to identify bills that contain item with description value
>> 2, the fastest, lowest-memory-overhead way I know to do it is
>>
>> . generate byte desc2 = desc == 2
>> . bysort account date (desc2): replace desc2 = desc2[_N]
>>
>> If there were a max function that worked like the sum function (I'm
>> not talking about the one Stata currently has, which doesn't work like
>> this), I could avoid the sort, since as I said my data is already
>> sorted by account date, and write merely:
>>
>> . by account date: generate bye desc2 = max(desc == 2)
>>
>> Mata already has a fast (built-in) function to find max and min in a
>> vector, which I could use on an st_view() of my dataset. But how do I
>> get that to work with the by: I perform in Stata?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/