| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Selection of panels
A problem that has arisen repeatedly in some recent threads is selection
of panels according to whether something did or did not happen during
the history of each panel. For "panels" read in general "blocks of
observations", but I will keep pretty close to the specific questions
posted.
Example 1 --------------------------------------------------------------
Richard Pitman wanted to select patients who were within -class- 3 at
-week- 0. (Assume an identifier -id-.)
------------------------------------------------------------------------
Phil Ryan and Maarten Buis had a neat -reshape- solution, but it leaves
two problems in its wake. Often, you would want to go back to the
original dataset for the next analysis. Also, the next time you have a
similar problem, you have the same need to restructure the dataset.
Recourse to -reshape- means a lot of back and forth in this situation,
so there is scope for another solution.
First, focus on selection
... if class == 3 & week == 0
The essence of the problem is that this condition selects observations,
not panels. But this condition is where to start. Consider a variable
gen byte select = class == 3 & week == 0
This is 1 when -class == 3 & week == 0- and 0 otherwise. All we need do
is spread any 1s to the other observations in the same panel, and that
is one line:
by id (select), sort : replace select = select[_N]
Sorting within panels according to -select- pushes any occurrences of 1
to the end, and we thus overwrite any 0s by 1s whenever there is a 1
within the panel.
So
... if select
now selects panels. We can do this even more concisely:
by id, sort : egen select = max(class == 3 & week == 0)
... if select
The step which most people seem to find hardest here is realising that
-egen, max()-, like many other -egen- functions, can feed on an
expression, which can be more complicated than a variable name.
Example 2 --------------------------------------------------------------
Sara Khan wanted to keep the longest spell of employment if it was
followed by an -exit- of 1 on the next -wave- (or the end of the panel).
------------------------------------------------------------------------
Scott Merryman provided a neat solution, although it seemed to hinge on
an assumption that the first observation in each panel couldn't have an
-exit- of 1.
-egen- also offers a solution here.
Sara's -longspl- is an indicator which is 1 within the longest spell and
0 otherwise. The end (last -wave-) of the longest spell is thus
bysort id (wave) : egen end = max(cond(longspl, wave, .))
Sorting on -wave- within panel isn't crucial at this point, but will be
needed for the next step.
Whether the next observation was an -exit- (or after the end of the
spell) is then
by id : gen next_exit = exit[end + 1] > 0
You can then select by
keep if longspl & next_exit
The code here is fairly compressed, so you might like some extra
comment. Skim or skip ahead if all is obvious.
For Sara's two examples, -end- will be 6 and 12, and so -exit[end + 1]-
will be -exit[7]- and -exit[13]-. (You shouldn't be surprised that an
expression including a variable name can appear in a subscript. That
is on all fours with constructs like -varname[_n - 1]- and
-varname[_n + 1]-.) These values are both positive, -exit[7]- because it
is 1 and -exit[13]- because it doesn't exist and so is returned by Stata
as missing. Subscripting here is naturally within panel, and I'm
assuming that Sara's -wave-s are all 1 up.
bysort id (wave) : egen end = max(cond(longspl, wave, .)) (1)
is a twist on the more obvious
bysort id (wave) : egen end = max(wave) if longspl (2)
or even
bysort id (wave) : egen end = max(longspl * wave) (3)
(2) leaves missings in observations that do not satisfy the -if-. That
can be fixed, but (1) is more general and gets you there directly. (3)
would work fine in this problem because the biggest
1 * wave
is always going to be bigger than the biggest
0 * wave.
------------------ aside: use of -egen, max(cond()) and min(cond())-
Only very recently have I tumbled to the trickery possible with (1).
This realisation is expressed accessibly in recent updates to two FAQs:
How can I drop spells of missing values at the beginning and
end of panel data? (with Gary Longton)
http://www.stata.com/support/faqs/data/dropmiss.html
How can I identify first and last occurrences systematically in
panel data?
http://www.stata.com/support/faqs/data/firstoccur.html
---------------------------------------------------------
Example 3 --------------------------------------------------------
Another of Sara's problems was to -drop- panels in which the person
started out unemployed. Again, a panel is to be selected or not
selected according to one observation in it.
-------------------------------------------------------------------
David Kantor (and, in essence, Sergiy Radyakin) had an excellent solution
bysort id (wave): drop if empstat[1] == "not emp"
This is much more direct than
bysort id (wave) : egen select = max(empstat[1] == "not emp")
drop if select
but the last shows that the -egen, max(<condition>)- technique can cope
with various different challenges.
Working your way through various possibilities, then for each panel
by id : egen, max(<condition>) gives 1 for condition ever occurring
0 never
by id : egen, min(<condition>) gives 1 for condition always occurring
0 not always
Thinking about these threads made me wonder whether an FAQ on the topic
would be a good idea. After a search, I realised that one already
exists!
How do I create a variable recording whether any members of a group (or
all members of a group) possess some characteristic?
http://www.stata.com/support/faqs/data/anyall.html
This gives rise to various wry reflections:
0. There is not much hope if people who have written documentation can't
remember that they have written it.
1. There is so much support available -- several hundred FAQs on
www.stata.com, the UCLA website, etc. -- that many people just don't
have the time or inclination even to scan to see what is available. Or,
it is very difficult to keep it all in mind!
2. There is a real keyword problem. That FAQ is not written about
panels, and would not be indicated by -search panel, faq-. But the Stata
solution there is the same as the Stata solution for panels here (apart
from the latter including a need to keep an eye on time order too).
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/