|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Re: Does Blasnik's Law apply to -use-?
...
Based on a few tests, it does appears to apply. The -in- approach reduced
execution time by about 50% when selecting 100K observations from the middle of
a file with 7 million obs.
In many cases, the difference in execution speed for each command is fairly
trivial -- in my tests the difference was only about 0.8 seconds. The real
speed benefits occur when the command is executed many times in a loop using a
large dataset -- such as identifying members of a each panel in a dataset with
1000's of panels. If -parmby- is similar to -statsby- then the speed benefits
will be substantial for users working with large datasets with many levels of
the -by- variable, but not very large for those with few levels or smaller
datasets.
Michael Blasnik
of Blasnik's law ;)
----- Original Message -----
From: "Newson, Roger B" <[email protected]>
To: <[email protected]>
Sent: Wednesday, September 12, 2007 10:03 AM
Subject: st: Does Blasnik's Law apply to -use-?
I have a query re Blasnik's Law, first named in the Statalist archives
by Nick Cox at
http://www.stata.com/statalist/archive/2007-08/msg00668.html
which states that using the -in- qualifier uses less computing time than
the equivalent -if- qualifier. For instance
regress mpg weight in 53/74
uses less time than
regress mpg weight if _n>=53 & _n<=74
because Stata does not have to check every observation in the dataset in
memory the first way, but has to do so the second way. My query is: Does
Blasnik's Law apply to the -use- command? That is to say, does the
statement
use mybigdata.dta in 3959/4030
use much less computing time than the statement
use mybigdata.dta if _n>=3959 & _n<=4030
which should input the same data into the memory? I ask because, as I
understand it, Stata datasets are sequential-access files (unlike SAS
datasets which I understand are random-access, with the option of having
multiple indices), and this should imply that Stata has to read through
observations 1 to 3958 before reading observation 3959.
My motivation is that I wish to streamline the command -parmby-, which
currently processes multiple by-groups by inputting the whole dataset
repeatedly, using the -restore, preserve- command, and then dropping all
by-groups except one. I am trying to think of a better way.
Best wishes (and thanks in advance)
Roger
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/