Kit is correct that this code is incorrect,
as one parenthesis was incorrectly placed,
a big deal in this case. Should be,
I think,
(1) gen spell = sum(L.time == .)
(2) bysort firm spell : gen length = _N
(3) bysort firm (length time) : keep if spell == spell[_N]
Sorry about that.
Perhaps this code looks (a) bizarre enough and
also (b) dependent enough on some very Stataish features
to deserve longer explanation.
(0) We're presupposing panel data that have been
-tsset-.
(1) L.time == . is a necessary and sufficient
criterion for the start of a consecutive spell
of observations (same panel, and time goes up by
1 from one to the next), as there is no
observation in the data for the -time- just
before the first observation in each such spell.
Helpfully, this applies also to the very
first observation in each panel, so we've
no worries about boundary cases (or spill-overs
from one panel to the next). This is
official Stata magic, given the way that
-tsset- and operators like -L.- work.
L.time == .
creates 1 if true and 0 if false,
so if you look at the results of that,
spells for one panel now look something like this,
inserting gaps just for emphasis,
1
0
0
0
1
0
0
1
and the cumulative -sum()- makes that
1
1
1
1
2
2
2
3
so that the consecutive series
now are helpfully identified in blocks.
(2) The length of each spell is just
the number of observations in it. _N
counts within groups defined by -by:-
in this instance.
(3) If we -sort- the longest spell
to the end of each panel, then the
example above will get mapped to
3
2
2
2
1
1
1
1
and our criterion for keeping
a spell is then just that -spell-
has the same value as the very
last value for each panel.
Not evident in this explanation,
but included in the code, is
sorting the _latest_ longest
spell to the end if there are
two or more spells of the same
maximum length.
(4) [not here] This needs some
thought just in case there are
spells for which -time- is missing.
Haven't done that.
P.S. What's the difference between this
code and the buggy code Kit used?
bysort firm length (time) : keep if spell == spell[_N]
keeps the (latest) longest spell of
each distinct length for each firm. Given "each
distinct length", "longest" is redundant as
a criterion, but I (and also Kit) get what we asked for.
bysort firm (length time) : keep if spell == spell[_N]
keeps the (latest) longest spell for each firm.
The following FAQ by Vince Wiggins and another
says more on L.time == . as a criterion
for the start of a spell of consecutive
observations.
http://www.stata.com/support/faqs/data/panel.html
That then goes through another solution. It
uses -egen-, which in refined Stata circles is
about as stylish as waltzing with muddy boots on.
P.P.S. the reference to "Red Sox" sounds like
an allusion to some local sporting trivium.
The FAQ says
"Statalist is an international list.
Please explain details that may make sense
only in your own corner of the world."
Nick
[email protected]
Kit Baum
> Nick said
>
> More instructive in some ways is to
> do it from scratch, with no use of user
> add-ons. Something like
>
> gen spell = sum(L.time == .)
> bysort firm spell : gen length = _N
> bysort firm length (time) : keep if spell == spell[_N]
>
> Nice, except that it does not work:
>
> |---------------------------|
> 358. | 10404 1987 .0511182 |
> 359. | 10404 2002 .0337511 |
> 360. | 10404 2003 .0296446 |
> |---------------------------|
>
> This firm has the original observations
>
> +---------------------------+
> | npermno year ita |
> |---------------------------|
> 358. | 10404 1987 .0511182 |
> 359. | 10404 1989 .0159272 |
> 360. | 10404 1990 .0455364 |
> 361. | 10404 1992 .0097333 |
> 362. | 10404 1993 .0231792 |
> 363. | 10404 1995 .0534575 |
> 364. | 10404 1996 .0622322 |
> 365. | 10404 2002 .0337511 |
> 366. | 10404 2003 .0296446 |
> +---------------------------+
>
> By coincidence, I have been working during the last 24 hours on an
> ado-file that does this "keep longest streak", but does it
> listwise for
> an entire variable list, as is required by some matrix software (that
> is, we need to generate the longest streak for which NONE of these
> variables are missing). It also deals with the case, as above, where
> the longest streak is tied; as an earlier posting suggests,
> the latest
> streak should be retained (which is what my code does). I'm
> pretty sure
> that it works, but I have given it to Nick to see if he can break it.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/