Title | Generating the last date | |
Author | Nicholas J. Cox, Durham University, UK |
I have data with multiple dates. I would like to generate a new date variable containing the last date for each record. There are several missing values for these dates.
Questions like this arise in various forms, but when discussing possible answers, it is assumed that your dates are held as Stata dates. For pertinent help within Stata, start at dates and times. The background on dates in Stata is explained in more detail in [U] 24 Dealing with dates and times.
The fact that you are dealing with dates does not, for once, complicate this question. The last date is simply the maximum date. Although you should always be careful when missing values are present, you can rely on Stata's maximum functions to do the smart thing about missings. Even though in Stata the numeric missing is treated as higher than any other numeric value, the maximum is reported as missing if and only if all values are missing.
First, suppose various dates are held as distinct variables and you want to get a row-wise maximum; that is, for each observation across variables. Use egen
. egen lastdate = rowmax(date variables)
or just the maximum function (see functions for more functions) with generate
. gen lastdate = max(date variables separated by commas)
The rowmax() function is the same as rmax() in Stata 8.
In terms of the result, the choice here is immaterial. Occasionally, when using the max() function, spelling out all variable names and separating them by commas can be a little inconvenient. You might also note for future reference that egen has other functions for manipulations across variables.
Next, suppose each record in the dataset consists of one or more observations and the record for each person (company, site, whatever) is uniquely coded by an identifier, say, id. To get a maximum across groups of observations, use another egen function:
. by id: egen lastdate = max(date variable)
Here the specification by id ensures that the calculation is carried out separately for each identifier. Note also that the max() function of egen differs from the general max() function used with generate or replace.
One detail you need to take care of yourself is attaching a date format to the new last date variable. That is, the new date variable does not inherit the format of the variables from which it is calculated. Thus for a format %tdd_m_y, type
. format lastdate %tdd_m_y
The first date of several is equally easy in Stata. The first date is, as will be clear by now, the minimum date. If anything, it is easier because, in addition to an egen way with min() or rowmin(), there are approaches from first principles, most notably for groups of observations on one variable date:
. by id (date), sort: gen firstdate = date[1]
To unpack this command line, Stata sorts on id first and then within id on date. Then, for each distinct id, the new variable firstdate is calculated as the first value, which is the minimum.
The equivalent for last date calculations,
. by id (date), sort: gen lastdate = date[_N]
will not be what you want: any missing values will end up as the last date for the ids for which they occur. (Recall that any numeric missing is treated as higher than any other numeric value, and so gets sorted to the end.) That can be fixed, but most users will find egen with max() more convenient here.
P.S. There is a tutorial on by:, _n, and _N manipulations in (Cox 2002), which explains, among other things, the special interpretation of [1] and [_N] under by:.