Title | Calculating the maximum and minimum of a sequence | |
Author | Nicholas J. Cox, Durham University |
The maximum value of a variable seen so far in a sequence is the “record” to date, that is, at least when high values are hard to achieve (think of field events such as jumping or throwing). It will become apparent that the best solution to getting this is easily adapted to get the case of the minimum seen so far, the record when low values are hard to achieve (think of track events such as running).
For example, consider some numeric response y measured in a series of years, year, and get the data in year order:
. sort year
If you look at a series of values, how would you determine the series of records without Stata? One simple algorithm, in a mixture of Stata and pseudocode, runs like this:
initialize: record[1] = y[1]
The first value observed is the highest so far.
loop: forvalues this = 2(1)_N { if y[this] > record[this − 1] { record[this] = y[this] } else record[this] = record[this − 1] }
The loop runs over the integers from 2 to the number of observations _N. So we start the loop with this set to 2, and the operation becomes
if y[2] > record[1] { record[2] = y[2] } else record[2] = record[1]
and next time around the loop we set this to 3, and we get
if y[3] > record[2] { record[3] = y[3] } else record[3] = record[2]
and so on. With inequalities, here >, it is always a good idea to check whether the fraternal twin, here >=, is needed instead, but a moment’s thought shows that if the new value is the same as the record to date, we do not need to change the record. (If your problem had some extra details, such as an interest in precisely when the record was broken or equalled, you would need to take extra care.)
Now setting this up as a looping problem is, in at least one sense, a red herring, as you do not need to reach for any of Stata’s looping machinery, forvalues, foreach, or while. In fact, you definitely should not do that. We can exploit the fact that generate and replace use Stata’s sort order, made explicit in Newson (2004) and in the “FAQ: How can I replace missing values with previous or following nonmissing values?”.
The first thing we should do is initialize. We could do that by setting every value of record to y[1]
. generate record = y[1]
or just by setting the first value of record to y[1]
. generate record = y[1] in 1
Without any instructions on what to do about record[2] onwards, Stata can only set those values to numeric missing (.).
In other problems, it will matter how you start; we make sure that it does not matter here by overwriting every value from the second onwards.
You probably already know about a general way to refer to the previous observation in Stata, using the subscript _n − 1. _n stands in general for the current observation number, and so _n − 1 for the one before. We will want to replace the values of record in observations 2 onward, so we can pencil that in. One way is
. replace record = ... in 2/l
where 2/l signals observations from 2 until the last (l for last). If we had indeed previously written
. generate record = y[1] in 1
We could pencil in
. replace record = ... if record == .
or
. replace record = ... if missing(record)
A key point here is that the replace statement will implement the loop in the algorithm previously given, as replace follows the current sort order. We just need to fill in the dots (...) in the command above. The dots will be a Stata translation of
if y[this] > record[this - 1] { record[this] = y[this] } else record[this] = record[this - 1]
One way to do this is through the cond() function. cond() lets you do different things, depending on whether a specified condition is true or false, which is what we have here. Kantor and Cox (2005) give a tutorial with several examples.
Our true or false condition is
y[this] > record[this - 1]
and, as previously mentioned, this is _n in Stata terms when referring to observations. In Stata, this condition becomes
y[_n] > record[_n - 1]
But y[_n] can just be written y, as Stata by default always works on the current observation number (or, more precisely, all of them in sequence).
cond() works like this:
cond(true-or-false-condition, result if true, result if false)
With one bound we get a solution:
. generate record = y[1] in 1 . replace record = cond(y > record[_n - 1], y, record[_n - 1]) in 2/l
This looks like progress, but we are not quite there. We should worry about what happens with missing values. If you know that your data will never be missing, you need not worry.
If y[1] were missing, record[1] would be too, and the replace statement would never change our mind about the record, as nothing can be greater than missing, and missing values would get propagated down the dataset in a cascade.
Similarly, if any y in observations 2 onwards were missing, the record would be set to missing and that value propagated downwards. The second problem is easier to fix:
. replace record = cond(y > record[_n - 1] & y < ., y, record[_n - 1])
Hence, we should change y only when it is nonmissing. The first problem requires extra surgery, left as a small exercise for you to consider. (If you know about extended missing values .a through .z, you know that they complicate the story a little without changing essentials.)
However, at this point, I am going to switch to another solution, which indeed may have occurred to you much earlier. It is just tempting to look at a cond() solution, because cond() is a good general do-it-yourself function in which you spell out what you want given two possible outcomes.
The maximum of two values (the larger) in Stata is given directly by max(). This has a property which at first sight may seem surprising, but needs to be grasped. Convince yourself of it by
. display max(1,.) 1 . display max(.,.) .
In brief, max() ignores missing values. Only when all arguments supplied are missing does it show its own limited version of despair and return a missing value. If this seems surprising, it is because you have learned—sometimes painfully—that in Stata numeric missing is reckoned to be arbitrarily large, or at least bigger than any other number that Stata can hold. max() is an exception to that, or perhaps it is better to say that max() ignores missing values unless it has no alternative. Either way, max() solves, or dissolves, all our worries about missing values.
. generate record = y[1] in 1 . replace record = max(y, record[_n - 1]) in 2/l
or
. generate record = y[1] in 1 . replace record = max(y, record[_n - 1]) if missing(record)
That is it. Running through the worries again:
If record[1] is missing because y[1] is missing, the first nonmissing value encountered is still used as the first nonmissing record because max(missing, nonmissing) is always nonmissing.
If any record from observation 2 on is missing, this does not matter for the same reason. So a missing value in the sequence will not contaminate the values of record perversely.
Two complications are easy to deal with. Suppose that you have panel (longitudinal) data. How do you deal with that? The answer is just to use by:, as in something like
. by id(year), sort: generate record = y[1] if _n == 1 . by id: replace record = max(y, record[_n - 1]) if missing(record)
One key feature of by: is that subscripts like [1] and [_n − 1] are interpreted within the groups defined by by:. If you want more on that, that is another story. See the sections of the manual on by: or Cox (2002).
Finally, if you want minimums and not maximums, use min() and not max().
All that said, since 2016 and for Stata 11+, the maximum so far or the minimum so far can be calculated directly with the community-contributed program rangestat, which should be installed from SSC using ssc inst rangestat. An example you can try yourself
. webuse grunfeld, clear . rangestat (max) mvalue, int(year . 0) by(company)
gives the flavor. Here the time interval is given by the option int(year . 0). year is the variable concerned and . (system missing) and 0 are negative and positive offsets. System missing as negative offset is here interpreted as including the earliest date in the data. 0 is just zero as usual, so the interval is a specification of "so far". The option by(company) ensures that calculations are carried out separately for each panel. The help for rangestat gives detailed discussion with several worked examples.
(A tip on Stata Tips: nonsubscribers to the Stata Journal can purchase a compact collation of the first 119 tips: see www.stata.com/bookstore/stata-tips.)