The two representations of the husbands-and-wives data above are called the wide and the long forms, and the choice between forms arises in a variety of contexts. Consider these two datasets:
|
|
Both datasets record the same data, but they organize the data differently. Whenever you deal with data where there are repeated observations, the organizational issue arises, and there is no right answer to the question.
. generate gro = (inc81-inc80)/inc80 . regress gro sex
. xtset id . xtreg inc sex, be
The statistical commands in my examples are used for illustration—what is important is realizing that sometimes I want one form and sometimes the other.
Given either of the above datasets, Stata can make the other, and Stata can convert it back again. You already know one way to do this because, conceptually, there is no difference between the 2-years-of-income example and the previous husbands-and-wives data example.
In fact, Stata has a command to make switching between forms easier, and this command will deal not just with two members within a group (such as husbands and wives or one year and another), but with many. First, we need some jargon:
|
|
A variable is called a "within-group constant", or just a "constant", if its value does not vary within a group. Variables id and sex are constants.
A variable is called a "within-group variable", or just a "variable", if its value varies within a group. In the wide form, the within-group variables are separate dataset variables; we have variables inc80 and inc81. In the long form, within-group variables are single dataset variables; we have variable inc, and another variable—the "grouping variable" (year)—serves to identify the groups.
We can get from one form to the other easily with the reshape command. Let me first redraw the wide and long data. Think of the data as a collection of observations x_ij.
|
|
The information reshape needs is the identity of the "i" variable(s), the identity of the "j" variable(s), and the identity of the "x_ij" variable(s). The syntax is easy.
. reshape long inc, i(id) j(year) (goes from wide to long) . reshape wide inc, i(id) j(year) (goes from long to wide)
After the reshape long or the reshape wide, we specify the "x_ij" variable name (when in long form) or variable stub name (when in wide form): inc.
The i() option identifies each logical observation—the i subscript, id. (Think in terms of the data in wide form.)
The j() option identifies the name of the grouping variable—the j subscript, year. Stata figures out the values contained in the grouping variable (80 and 81) from either the variable names when in wide form or the variable values when in long form.
Note that we do not specify the sex variable. With
reshape, the unspecified variables should be constant
within each level of the i() variables. If this is not true,
reshape will give you an informative error message. If this is not true,
reshape
will give you an informative error message.
Type the following:
. webuse reshape2, clear . list
+----------------------------------+ | id sex inc80 inc81 inc82 | |----------------------------------| 1. | 1 0 5000 5500 6000 | 2. | 2 1 2000 2200 3300 | 3. | 3 0 3000 2000 1000 | 4. | 2 0 2400 2500 2400 | +----------------------------------+. reshape long inc, i(id) j(year)
There is an error in the wide form of the data, and when I typed
reshape long inc, i(id) j(year)
, I got
. reshape long inc, i(id) j(year)
(note: j = 80 81)
variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You
specified i(id) and j(year). In the current wide form, variable id should
uniquely identify the observations. Remember this picture:
long wide
+---------------+ +------------------+
| i j a b | | i a1 a2 b1 b2 |
|---------------| <--- reshape ---> |------------------|
| 1 1 1 2 | | 1 1 3 2 4 |
| 1 2 3 4 | | 2 5 7 6 8 |
| 2 1 5 6 | +------------------+
| 2 2 7 8 |
+---------------+
Type reshape error for a list of the problem observations.
r(9);
Stata saw that there was an error—id
was not unique for each
observation. We can find out more with the reshape
error
command.
. reshape error
(j = 80 81 82)
i (id) indicates the top-level grouping such as subject id.
The data are currently in the wide form; there should be a single
observation per i.
2 of 4 observations have duplicate i values:
+----+
| id |
|----|
2. | 2 |
3. | 2 |
+----+
(data now sorted by id)
reshape
found that the observations that were not unique.
To fix the data so that they can be reshaped, we can type
. replace id = 4 if id==2 & sex==0
(1 real change made). reshape long inc, i(id) j(year)
(j = 80 81 82) Data Wide -> Long ----------------------------------------------------------------------------- Number of observations 4 -> 12 Number of variables 5 -> 4 j variable (3 values) -> year xij variables: inc80 inc81 inc82 -> inc -----------------------------------------------------------------------------. list
+------------------------+ | id year sex inc | |------------------------| 1. | 1 80 0 5000 | 2. | 1 81 0 5500 | 3. | 1 82 0 6000 | 4. | 2 80 1 2000 | 5. | 2 81 1 2200 | |------------------------| 6. | 2 82 1 3300 | 7. | 3 80 0 3000 | 8. | 3 81 0 2000 | 9. | 3 82 0 1000 | 10. | 4 80 0 2400 | |------------------------| 11. | 4 81 0 2500 | 12. | 4 82 0 2400 | +------------------------+
After we have used reshape
once, Stata understands the
structure, and if the data are currently long, we can simply type
. reshape wide
(j = 80 81 82) Data Long -> Wide ----------------------------------------------------------------------------- Number of observations 12 -> 4 Number of variables 4 -> 5 j variable (3 values) year -> (dropped) xij variables: inc -> inc80 inc81 inc82 -----------------------------------------------------------------------------. list
+----------------------------------+ | id inc80 inc81 inc82 sex | |----------------------------------| 1. | 1 5000 5500 6000 0 | 2. | 2 2000 2200 3300 1 | 3. | 3 3000 2000 1000 0 | 4. | 4 2400 2500 2400 0 | +----------------------------------+
and the data are switched to the wide form. On the other hand, if the data are currently wide, we can type
. reshape long
(j = 80 81 82) Data Wide -> Long ----------------------------------------------------------------------------- Number of observations 4 -> 12 Number of variables 5 -> 4 j variable (3 values) -> year xij variables: inc80 inc81 inc82 -> inc -----------------------------------------------------------------------------. list
+------------------------+ | id year inc sex | |------------------------| 1. | 1 80 5000 0 | 2. | 1 81 5500 0 | 3. | 1 82 6000 0 | 4. | 2 80 2000 1 | 5. | 2 81 2200 1 | |------------------------| 6. | 2 82 3300 1 | 7. | 3 80 3000 0 | 8. | 3 81 2000 0 | 9. | 3 82 1000 0 | 10. | 4 80 2400 0 | |------------------------| 11. | 4 81 2500 0 | 12. | 4 82 2400 0 | +------------------------+
and the data are switched to the long form.
Once we have given the definitions, we
can switch back and forth by typing reshape wide
and
reshape long
for the remainder of our session
without redefining the groups, variables,
and constants.
Both the manual (see [D] reshape) and the online help (see help reshape
) show this example with three years of data, and
there is no reason you cannot use reshape
with four, five, or
more years of data, nor are you limited to one within-group variable. In some
other example, we might have typed
. reshape long inc hours wksue, i(id) j(year)
to go from wide to long form. Say that year
takes on values from
80 to 88 and there are some additional unspecified variables (sex
,
age
, and ownshome
).
Thus, reshape long inc hours wksue, i(id) j(year)
says that the variables are
Wide form | Long form |
---|---|
id | id |
sex, age, ownshome | sex, age, ownshome |
year | |
inc80, inc81, ..., inc88 | inc |
hours80, hours81, ..., hours88 | hours |
wksue80, wksue81, ..., wksue88 | wksue |
There are other more advanced features of reshape
that
you can learn about in the manual (see [D] reshape). Most cases
can be handled with the simple syntax I have illustrated.