IMHO, Michael's results can be rationalized by hypothesizing that the
-in- qualifier causes -use- to read until it gets to the beginning of
the -in- range (throwing the input away), and then to read the -in-
range (copying the input to the dataset in memory), and then to close
the file containing the dataset. This would not have much effect on the
amount of file input required to read the last 1000 observations from a
file dataset containing millions of observations, but will approximately
halve the amount of file input required to read the middle 1000
observations, and have a specatcular effect on the time required to read
the first 1000 observations, which might then be negligible compared to
the fixed cost of opening and closing the file.
If my hypothesis is correct, then using -in- to read every one of a
large number of small by-groups will approximately halve the total
required file input. Unfortunately, the time taken will still be
quadratic in the number of by-groups. That is to say, doubling the
number of by-groups (and keeping the average by-group size constant)
will approximately quadruple the file input, not approximately double
it. This would be different from Blasnik's law (as I have always
understood it to apply to datasets already in memory), which implies
that -statsby- can process each by-group without processing any of the
other by-groups, implying an execution time linear in the number of
by-groups. Therefore, using the -in- qualifier with -use- will not have
the spectacular effect observed earlier with -statsby-.
The bottom-line consequence of my hypothesis appears to be that, if the
user is working for the Office of Galactic Statistics and has a by-group
for each of millions of planets, then the user should use a conventional
indexed SQL-based database to create a separate Stata dataset for each
planetary by-group, and then call -parmby- separately for each planetary
dataset (either serially or in parallel).
Is my hypothesis correct?
Best wishes
Roger