Paul Seed
>
> Taggert J Brooks' problem <[email protected]>
> can be approached in a different way that gets round
> the missing data / incomplete panel issue.
>
> > I have a panel data set on individual
> > countries and I
> > want to create a new variable. The variable is going to be
> > the ratio of a
> > variable I already have in my dataset, we'll call it P. I
> > want to create a
> > variable we'll call X where x=p/p(i==72). That is to say I
> > want to deflate
> > all the values of p, by the value of p for country 72, for
> > the corresponding
> > year.
>
> Assume the data is saved in myfile.dat.
>
> use country year p using myfile if country == 72, clear
> drop country
> rename p p72
> mmerge year using myfile , type(1:n)
> gen x = p/p72
>
> x will be missing only if p72 == 0, p ==. or p72 == .
>
There are clearly different ways to solve this
problem, but it _can_ be done without file
manipulations, keeping all data in place.
There is, as far as I can see, no missing
data or incomplete panel issue, but I should have
explained that more clearly in earlier postings.
Scott Merryman posted a solution using an
example with a well-behaved data set. His example
exploited the structure of that data set and
was clearly ad hoc and designed to show an
approach, and not a general solution.
Generalisation is easy using the white magic
of by:.
Scott and I are preparing a FAQ on the question,
and here is part of a draft:
I have panel data. One panel (country, company, person,
whatever) serves as a reference panel. How do I relate
variables for other panels to that reference panel,
say as a ratio (this panel's value for this time) /
(reference panel's value for this time)?
Answer:
1. Identify the variable of interest and reference panel
To fix ideas, we will work with a panel data set which
may be downloaded from the Stata website:
. use http://www.stata-press.com/data/r7/grunfeld.dta, clear
This data set includes economic data on 10 anonymous companies for
20 years, 1935-54. The data have been -tsset- and the panel variable
is
-company- and the time variable is -year-. Formally setting the data
as
time series using -tsset- is not essential to what follows,
but it is a good idea anyway.
The company designated 1 has highest -mvalue- throughout this period,
so we will use this as the reference company to illustrate technique,
and work with -mvalue- for illustration.
Selecting that company's values is easy:
. gen mvalue_ref = mvalue if company == 1
Let us emphasise here that the natural consequence of that
command is to put missing values into the observations
of -mvalue_ref- for other companies.
2. Spread the reference panel's values to other panels
The next step, and the one that at first sight appears trickiest,
is to spread those values to other companies. Let us imagine the
results of a -sort- first on -year- and then within -year- on
-mvalue_ref-.
For example, the first value of -year- is 1935, and within all the
values for 1935 the ten companies would be sorted according to
-mvalue_ref-. Now at most one value for -mvalue_ref- for 1935 will
be non-missing, the value for company 1; all others will be
missing, as just mentioned. The result of a -sort- on numeric
values is always to place missing values last, after any
non-missing values. In short, for each year, any non-missing
value will always end up as the first value. We can exploit
this fact to spread the reference panel's values to other panels:
. bysort year (mvalue_ref) : replace mvalue_ref = mvalue_ref[1]
This statement does a lot at once, which can be unpacked as follows:
- sort on -year- and then within -year- on -mvalue_ref-
- for each block of data for each year, replace -mvalue_ref- by
the first value it takes in that block. (Under -by varlist:-,
subscripts such as [1] are interpreted within each block,
not with reference to the data set as a whole.)
For more on -bysort-, see the manuals, or a tutorial
at Stata Journal 2(1), 86-102 (2002).
3. Calculate the desired quantity
Now we are home and dry, and can calculate whatever we want, say
. gen mvalue_sc = mvalue / mvalue_ref
or the logarithm of that, or a difference.
4. What could go wrong with this approach?
This example data set appears to be in excellent condition.
There are no missing values in the data set, and it is complete
with all companies represented for all years. But two things
often go wrong with other panel data sets.
First, suppose that -mvalue- were in fact missing for 1935
and company 1. Then all 10 values of -mvalue_ref- would be
born missing for 1935, and remain so after -sort-ing and
then -replace-ment. Some missing value would get sorted to
first position within 1935 (it does not matter which as
they are all identical) and a missing would get overwritten
by a missing. This seems fair.
Second, suppose -mvalue- for 1935 and company 1 were in fact
omitted from the data set. Then all 9 values of -mvalue_ref-,
for companies 2 through 10, would be born missing for 1935,
and the result would resemble that in the previous situation,
although for a different reason.
Thus the two main problems have reasonable consequences and
do not affect observations except for the year to which they apply.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/