Title | Identifying runs of consecutive observations in panel data | |
Author |
Nicholas J. Cox, Durham University, UK Vince Wiggins, StataCorp |
I have panel data with some gaps. I want to look systematically at runs of consecutive observations, especially the length of the longest run in each panel. How do I do this?
As so often happens, there is a direct solution to this problem making use of Stata’s built-in features, and a canned convenience program that encapsulates some of the basic tricks in the neighborhood. We will describe both approaches.
Stata’s jargon of panel data borrows one of many possible terminologies. Depending on your field, you may prefer to think in terms of each patient, firm, country, station, site, or whatever else it is for which you have each separate time series. For more background, see help tsset or [TS] tsset.
First, suppose that you have tsset your panel data by some command like
. tsset id time
This command declares the basic structure of your data with a panel identifier id and a time variable time to Stata’s time-series commands. It also allows you full use of appropriate features, including time-series operators, ensuring, in particular, that they work properly when there are gaps in observations.
Suppose, for example, that we have observations for one panel with times
1, 2, 3, 5, 6, 7, 8, 9, 11, 12
Then we have three runs of consecutive observations
1, 2, 3 5, 6, 7, 8, 9 11, 12
and the longest has length 5. There are gaps before the observations with times 5 and 11.
Here is a complete solution from first principles, which we will unpack in a moment:
. gen run = . . by id: replace run = cond(L.run == ., 1, L.run + 1) . by id: egen maxrun = max(run)
The main idea is to exploit the fact that if there is a gap before any observation, as before the observations with times 5 and 11 above, then
L.varname
is missing for any numeric variable you care to specify. It’s also true that L.varname[1] is always treated as missing. Since there is no observation before the first, Stata certainly has no idea about its contents. (Or perhaps, Stata uncertainly has no idea....)
We generate a new variable run containing missing values so that it will exist for our next step. Then we replace run with the rule, implemented by a call to the cond() function,
Here we can rely on Stata to generate or replace observations in the current sort order (and, moreover, for any use of time-series operators to work, the data must be in tsset order). See, for example, Newson (2004) or the FAQ entitled How can I replace missing values with previous or following nonmissing values?.
Anyway, given times
1, 2, 3 5, 6, 7, 8, 9 11, 12
the rule replaces the variable run with values
1, 2, 3 1, 2, 3, 4, 5 1, 2
because counting restarts after a gap. The by id: in
. by id: replace run = cond(L.run == ., 1, L.run + 1)
flags that we do this separately for each panel. In fact, the by id: is for our benefit rather than Stata’s, as the right-hand side is automatically calculated separately for each panel. That is, L.run means the previous value of run for this panel whenever panels have been specified. However, it is important that we specify by id: within
. by id: egen maxrun = max(run)
as egen takes no automatic account of separate panels. Clearly, we could look at other properties of panel lengths using egen or other commands, depending on what was of interest.
The general case of time series with separate panels also collapses nicely to the special case of one panel, in which we need not bother to specify any panel identifier. The commands
. tsset time . gen run = . . replace run = cond(L.run == ., 1, L.run + 1)
produce a new variable recording sequence in run. Later,
. egen maxrun = max(run)
would work fine, but the new variable would contain the same constant in every observation, so looking at run directly, say, with summarize, would be better.
A community-contributed program tsspell may be downloaded using ssc, which can solve this problem and several others based on subdividing time series. (If ssc does not work in your Stata, see the FAQ: Stata 7: I see references to the findit and ssc commands on Statalist, but my Stata does not recognize these commands. What should I do?.)
tsspell examines the data, which must be tsset time series, to identify spells or runs, which are contiguous sequences defined by some condition. tsspell generates new variables:
By default, these variables will be called _spell, _seq, and _end.
If the data are panel data, all operations are automatically performed separately within panels.
There are four ways of defining spells in tsspell.
First, given
tsspell varname
a new spell starts whenever varname changes. Strictly, the condition is
(varname != L.varname) | (_n == 1)
Here the condition _n == 1 is protection against the possibility that the first value is missing.
Second, a new spell starts whenever some condition defining the first observation in a spell is true. A spell ends just before a new spell starts. Such a condition may be specified by the fcond() option. Spells started by earthquakes, eruptions, accidents, revolutions, elections, births, or other traumatic events may often be defined in this general way.
The problem of runs (or spells) of consecutive observations is an example. A new spell starts whenever L.varname is missing, which, as said, works for the first observation as well.
. tsspell, f(L.time == .)
sets up the spells, after which maximum length of run is calculated as before:
. by id: egen maxrun = max(_seq)
For the example above, the result is indicated by
. list time _spell _seq _end maxrun +--------------------------------------+ | time _spell _seq _end maxrun | |--------------------------------------| 1. | 1 1 1 0 5 | 2. | 2 1 2 0 5 | 3. | 3 1 3 1 5 | 4. | 5 2 1 0 5 | 5. | 6 2 2 0 5 | |--------------------------------------| 6. | 7 2 3 0 5 | 7. | 8 2 4 0 5 | 8. | 9 2 5 1 5 | 9. | 11 3 1 0 5 | 10. | 12 3 2 1 5 | +--------------------------------------+
Although in this example we have results for only one panel, other panels would be treated separately.
Third, spells are defined by some condition being true for every observation in the spell. A spell ends when that condition becomes false. Such a condition may be specified by the cond() option.
Fourth, a special but useful case of the previous kind is
cond(varname > 0 & varname < .)
That is, values of varname are positive (but not missing). Given daily data, spells of rain are defined by there being some rainfall every day. As a convenience, such conditions may be specified by pcond(varname), or more generally, pcond(expression).
We will wrap up by mentioning other rules applied by tsspell:
For other examples applying tsspell, please see its help file.