Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Splitting Dataset - Save by unique identifier
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Splitting Dataset - Save by unique identifier
Date
Sun, 28 Oct 2012 11:21:38 +0000
We can't advise on speeding up code you don't show us or don't explain.
In general,
0. Stata is fairly fast so long as it can hold all data in memory.
What's fastest are built-in commands written in C (invisible to the
user) and/or Mata (partly visible to the user). What's slower is the
same problem approached as interpreted ado code. What's slowest of all
is writing your code to loop over observations, as in one of your
previous posts. Only rarely is that the best practical approach.
1. What's best for you depends on (a) how big your dataset is, (b)
what your computer set-up is and (c) what you're doing. Even if we
knew all that, there is a still a sense in which only experiments
given (a) (b) (c) can imply what's fastest for you. You will know
this!
2. That said, my visceral feeling is that reading in the same dataset
400 times can't be the best way to do something, nor can splitting a
dataset into 10000 smaller datasets.
3. You may not be aware of Blasnik's Law, not even without knowing
that name. (I named this law after Michael Blasnik, who did a lot on
this list to make clear how much it can bite.)
See e.g. http://www.stata.com/statalist/archive/2007-09/msg00264.html
for an example, but I note that the term was in use at least by 2004.
Blasnik's Law is that whenever a task can be done using -if- and
equivalently using -in-, then the -in- solution will be (much) faster.
In your case anything that centres on
<some stuff> if permno == <some value>
can be very slow because Stata will just test every observation to
work out whether the -if- condition is true. This will often be faster
sort permno <whatever>
* regardless of any irregularities in -permno-, -permgroup- will take
values 1 up
egen permgroup = group(permno)
su permgroup, meanonly
local gmax = r(max)
gen long obsno = _n
forval g = 1/`gmax' {
su obsno if permgroup == `g'
local min = r(min)
local max = r(max)
<all operations for this group> in `min'/`max'
}
See also
SJ-7-3 st0135 . . . . . . . . . . . Stata tip 50: Efficient use of summarize
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q3/07 SJ 7(3):438--439 (no commands)
tip on using the meanonly option of summarize to more
quickly determine the mean and other statistics of a
large dataset
Directly accessible at http://www.stata-journal.com/sjpdf.html?articlenum=st0135
Specifically, as above I can't see what you are imagining -- splitting
into thousands of small datasets -- is a good idea, but on how to do
it: see -savesome- (SSC) as a convenience command which you would need
to call in a loop. It does _not_ rule out reading the whole dataset
back in once you have -save-d a part of it.
Nick
On Sat, Oct 27, 2012 at 10:28 PM, Tim Streibel <[email protected]> wrote:
> I am having a question I am currently computing abnormal returns in a way that implies opening a large dataset (about 2m obs.) about 400 times which I think costs a lot of time.
>
> So my idea is to create small datasets (for each stock one dataset). Is there a way to quickly create a dataset only containing the observations of one stock (uniquely identified by Permno)?
>
> Currently my only idea is to open the large dataset drop all obs. except the ones of one stock and save it. But doing that for every stock forces me to open the large dataset 10 000 times, so it doesn't really save me time.
>
> Some combination of by (permno) and save would be cool.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/