Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Splitting Dataset - Save by unique identifier |
Date | Sun, 28 Oct 2012 11:21:38 +0000 |
We can't advise on speeding up code you don't show us or don't explain. In general, 0. Stata is fairly fast so long as it can hold all data in memory. What's fastest are built-in commands written in C (invisible to the user) and/or Mata (partly visible to the user). What's slower is the same problem approached as interpreted ado code. What's slowest of all is writing your code to loop over observations, as in one of your previous posts. Only rarely is that the best practical approach. 1. What's best for you depends on (a) how big your dataset is, (b) what your computer set-up is and (c) what you're doing. Even if we knew all that, there is a still a sense in which only experiments given (a) (b) (c) can imply what's fastest for you. You will know this! 2. That said, my visceral feeling is that reading in the same dataset 400 times can't be the best way to do something, nor can splitting a dataset into 10000 smaller datasets. 3. You may not be aware of Blasnik's Law, not even without knowing that name. (I named this law after Michael Blasnik, who did a lot on this list to make clear how much it can bite.) See e.g. http://www.stata.com/statalist/archive/2007-09/msg00264.html for an example, but I note that the term was in use at least by 2004. Blasnik's Law is that whenever a task can be done using -if- and equivalently using -in-, then the -in- solution will be (much) faster. In your case anything that centres on <some stuff> if permno == <some value> can be very slow because Stata will just test every observation to work out whether the -if- condition is true. This will often be faster sort permno <whatever> * regardless of any irregularities in -permno-, -permgroup- will take values 1 up egen permgroup = group(permno) su permgroup, meanonly local gmax = r(max) gen long obsno = _n forval g = 1/`gmax' { su obsno if permgroup == `g' local min = r(min) local max = r(max) <all operations for this group> in `min'/`max' } See also SJ-7-3 st0135 . . . . . . . . . . . Stata tip 50: Efficient use of summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox Q3/07 SJ 7(3):438--439 (no commands) tip on using the meanonly option of summarize to more quickly determine the mean and other statistics of a large dataset Directly accessible at http://www.stata-journal.com/sjpdf.html?articlenum=st0135 Specifically, as above I can't see what you are imagining -- splitting into thousands of small datasets -- is a good idea, but on how to do it: see -savesome- (SSC) as a convenience command which you would need to call in a loop. It does _not_ rule out reading the whole dataset back in once you have -save-d a part of it. Nick On Sat, Oct 27, 2012 at 10:28 PM, Tim Streibel <Tim.Streibel@gmx.de> wrote: > I am having a question I am currently computing abnormal returns in a way that implies opening a large dataset (about 2m obs.) about 400 times which I think costs a lot of time. > > So my idea is to create small datasets (for each stock one dataset). Is there a way to quickly create a dataset only containing the observations of one stock (uniquely identified by Permno)? > > Currently my only idea is to open the large dataset drop all obs. except the ones of one stock and save it. But doing that for every stock forces me to open the large dataset 10 000 times, so it doesn't really save me time. > > Some combination of by (permno) and save would be cool. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/