Here is a little nitty-gritty problem,
and I do know Stata solutions. My interest
is whether there are others I have missed,
and above all in views on what is most
natural as a solution, and what has fewest
possible disadvantages or side-effects. (As a secondary
detail I have a proposal for generalising
two existing Stata functions.)
I want to round down, in multiples of
some (fixed) number. For concreteness, say
I want to round -mpg- in the auto data
in multiples of 5, so that any values
10-14 get rounded to 10, any values 15-19
to 15, etc. (-mpg- is simple in that
only integer values occur; in many other
cases we clearly have fractional parts to think
about as well.)
Note that the solution is _not_ the function call
round(mpg, 5)
as this rounds to the nearest multiple
of 5, which could be either rounding up
or rounding down: often useful, but
not what I want here.
round(mpg - 2.5, 5)
seems all right, but also a little too
much like a dodge.
Similarly, the solution could be the function call
-recode(-mpg,-40,-35,-30,-25,-20,-15,-10)
but that's a bit backward for my taste.
Note all the the negative signs in the above:
negating and then negating to reverse it are made necessary
by the fact that -recode()- uses
its numeric arguments as upper limits,
i.e. it rounds up. However, this is not the same as
recode(mpg,15,20,25,30,40,45) - 5
as with the latter values of exactly 15 20 ...
get mapped to 10 15 ... , again not what
I want.
recode(mpg,14,19,24,29,34,39,44) - 4
fixes that, but I find it a bit too
much like thinking to have to work that
out, especially on the fly, and it doesn't
generalise easily to non-integers so far
as I can see. (Subtract 4.9, or 4.99, etc.
and you could run into precision problems.)
-egen, cut()- offers another solution:
egen ... = cut(mpg), at(10(5)45)
Being able to specify a numlist is nice here,
as compared with spelling out a comma-separated
list, but you _must_ add a limit here (45) which
will not be used; otherwise with
egen ... = cut(mpg), at(10(5)40)
your highest class will be missing (_not_ 40).
There was some discussion of this behaviour
on Statalist several months ago; although the original
authors of -cut()- (Michael Hills and David
Clayton) must have had a reason for implementing
-cut()- in this manner, which was echoed in the
adoption by Stata Corp, I don't find this behaviour
intuitive.
For some reason, I think of this 45 as like
the piece of meat the hero(ine) has to throw
to the guard dog to avoid being bitten (or
worse).
My favourite is none of these but
5 * floor(mpg/5)
Here -floor()- always rounds down to the integer
less than or equal to its argument. The name floor
is due to Kenneth E. Iverson, the principal architect
of APL, who introduced it some time before 1962.
As it happens
5 * int(mpg/5)
gives exactly the same result for -mpg- in the auto
data, but in general whenever variables may be
negative as well as positive,
interval * floor(expression/interval)
gives a more consistent classification.
This solution needs a little thinking to appreciate,
but grows on one, and it has the merit that you don't need to
spell out all the possible end values (with the risk
of forgetting some or mistyping some). (-recode()-
and -egen, cut()- are not restricted to rounding
in equal intervals and of course remain useful for
more complicated problems.)
Without recapitulating the whole argument insofar
as it applies to rounding up, -floor()-'s sibling
-ceil()- (short for _ceiling_) is a nice way
of rounding up in equal intervals:
interval * ceil(expression/interval)
and is easier to work with than expressions
based on -int()-.
I have written -egen- functions -down()-
and -up- for which the calls would be (e.g.)
egen ... = down(mpg,5)
but I incline to thinking that there is
little pain and much gain in learning
how to do it with -floor()- and -ceil()-.
Any comments?
Nick
[email protected]
P.S. my proposal is to generalise
-floor()- so that it may take two
arguments, in which case
floor(expression, #)
is
# * floor(expression / #)
and similarly for -ceil()-.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/