Nisha Malhotra posted a panel data problem which
attracted a flurry of overlapping answers,
from which an acceptable solution should emerge,
once Nisha has sorted out whether the jump
should take place at or after the first
action and what is appropriate for the very
first value in a panel (for which previous
conditions are unknown, at least to Stata).
I want to expand on a point arising which is much
more general and can bite you (and you won't
always notice). Let's abstract to a structure of
panel identifier
id
and time variable
time
The problem is with code like this:
. sort id
. by id : gen <whatever>
which Stata 7 users can happily telescope to
. bysort id : gen <whatever>
The way this arises is that
(1) you want to do something separately
for each panel
and
(2) you know that Stata requires a prior
-sort- for that, so you oblige. (More
than courtesy here: it's the law.)
What's tricky is that the code often
should be
. sort id time
. by id : gen <whatever>
or the equivalent
. bysort id (time) : gen <whatever>
-- whenever, that is, you also want
observations within each panel to be in
time order. Even when correct within-panel order
is irrelevant to what you want, as when say you are
computing means, it rarely does any harm.
What underlies all this is the literal-mindedness
of Stata, which does what you say, not what
you mean. Given the instruction
. sort id
Stata will be satisfied with _any_ ordering
of observations for which -id- is sorted,
and there are usually lots of possibilities,
as some combinatorial calculations will confirm.
Stata does not care about any other point.
Indeed, having done what you want, it sits there
smirking.
Now it is often the case in practice that panel data
will come in order of -id- and then -time-,
or will be left that way after a previous command.
And, increasingly, it is a standard
that Stata commands should not
change the -sort- order of your
data unless you explicitly specify
that or it is among the purposes
of a command. So no harm may ensue.
But -- as said, and this is the crunch --
Stata makes absolutely no promises about
order of observations within each block defined
by -id- (or within any other varlist
given as argument to -sort-). So there
is a possibility that operations dependent
on within-panel order will give incorrect results.
In the problem here, operations based on
the -sum()- function are a case in point.
With panel data there is another and
in many ways a better approach. -tsset-
your data and use time-series operators.
Then given some initial
. tsset id time
any later
. tsset
will automatically return panel data
to the correct sort order, so that
. by id: ...
is then guaranteed to work on the
correct within-panel order. In
addition, Stata refuses to
do calculations based on operators
such as L. unless data are in the
correct sort order, providing for
you a safety catch. Conversely, for
operators like L. you don't
need to specify separate
calculations within panels:
that is done automatically
given a -tsset- to panel data.
-sum()-, however, has nothing
to do with time series as such. It
long predates specific time series
syntax in Stata and indeed stands outside
that framework.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/