Home  /  Learn  /  NetCourses  /  Sample lesson NetCourse 101
The following is a 3-page extract from the 32 pages of Lecture 4, NetCourse® 101: Introduction to Stata.

Sample lesson from NetCourse 101

Wide versus long data

The two representations of the husbands-and-wives data above are called the wide and the long forms, and the choice between forms arises in a variety of contexts. Consider these two datasets:

Wide form
id sex inc80 inc81
1 0 5000 5500
2 1 2000 2200
3 0 3000 2000
 
Long form
id year sex inc
1 80 0 5000
1 81 0 5500
2 80 1 2000
2 81 1 2200
3 80 0 3000
3 81 0 2000

Both datasets record the same data, but they organize the data differently. Whenever you deal with data where there are repeated observations, the organizational issue arises, and there is no right answer to the question.

  • Suppose that I want to look at income growth by sex. The wide form makes that easy to see. I might, for instance, type
    . generate gro = (inc81-inc80)/inc80
    
    . regress gro sex
    
  • Suppose that I want to study income level by sex. Now it is the long form that more easily shows the answer to the question. I might type
    . xtset id
    
    . xtreg inc sex, be
    

The statistical commands in my examples are used for illustration—what is important is realizing that sometimes I want one form and sometimes the other.

Given either of the above datasets, Stata can make the other, and Stata can convert it back again. You already know one way to do this because, conceptually, there is no difference between the 2-years-of-income example and the previous husbands-and-wives data example.

In fact, Stata has a command to make switching between forms easier, and this command will deal not just with two members within a group (such as husbands and wives or one year and another), but with many. First, we need some jargon:

Wide form
constants variables
id sex inc80 inc81
group 1--> 1 0 5000 5500
group 2--> 2 1 2000 2200
group 3--> 3 0 3000 2000
Long form
constant "the" grouping variable constant variable
id year sex inc
group 1--> 1 80 0 5000
1 81 0 5500
group 2--> 2 80 1 2000
2 81 1 2200
group 3--> 3 80 0 3000
3 81 0 2000

A variable is called a "within-group constant", or just a "constant", if its value does not vary within a group. Variables id and sex are constants.

A variable is called a "within-group variable", or just a "variable", if its value varies within a group. In the wide form, the within-group variables are separate dataset variables; we have variables inc80 and inc81. In the long form, within-group variables are single dataset variables; we have variable inc, and another variable—the "grouping variable" (year)—serves to identify the groups.

We can get from one form to the other easily with the reshape command. Let me first redraw the wide and long data. Think of the data as a collection of observations x_ij.

Wide form
i   x_ij
id sex inc80 inc81
1 0 5000 5500
2 1 2000 2200
3 0 3000 2000
 
Long form
i j   x_ij
id year sex inc
1 80 0 5000
1 81 0 5500
2 80 1 2000
2 81 1 2200
3 80 0 3000
3 81 0 2000

The information reshape needs is the identity of the "i" variable(s), the identity of the "j" variable(s), and the identity of the "x_ij" variable(s). The syntax is easy.

. reshape long inc, i(id) j(year)   (goes from wide to long)

. reshape wide inc, i(id) j(year)   (goes from long to wide)

After the reshape long or the reshape wide, we specify the "x_ij" variable name (when in long form) or variable stub name (when in wide form): inc.

The i() option identifies each logical observation—the i subscript, id. (Think in terms of the data in wide form.)

The j() option identifies the name of the grouping variable—the j subscript, year. Stata figures out the values contained in the grouping variable (80 and 81) from either the variable names when in wide form or the variable values when in long form.

Note that we do not specify the sex variable. With reshape, the unspecified variables should be constant within each level of the i() variables. If this is not true, reshape will give you an informative error message. If this is not true, reshape will give you an informative error message. Type the following:

. webuse reshape2, clear

. list

     +----------------------------------+
     | id   sex   inc80   inc81   inc82 |
     |----------------------------------|
  1. |  1     0    5000    5500    6000 |
  2. |  2     1    2000    2200    3300 |
  3. |  3     0    3000    2000    1000 |
  4. |  2     0    2400    2500    2400 |
     +----------------------------------+

. reshape long inc, i(id) j(year)

There is an error in the wide form of the data, and when I typed reshape long inc, i(id) j(year), I got

. reshape long inc, i(id) j(year)
(note:  j = 80 81)
variable id does not uniquely identify the observations
    Your data are currently wide.  You are performing a reshape long.  You
    specified i(id) and j(year).  In the current wide form, variable id should
    uniquely identify the observations.  Remember this picture:

         long                                wide
        +---------------+                   +------------------+
        | i   j   a   b |                   | i   a1 a2  b1 b2 |
        |---------------| <--- reshape ---> |------------------|
        | 1   1   1   2 |                   | 1   1   3   2  4 |
        | 1   2   3   4 |                   | 2   5   7   6  8 |
        | 2   1   5   6 |                   +------------------+
        | 2   2   7   8 |
        +---------------+
    Type reshape error for a list of the problem observations.
r(9);

Stata saw that there was an error—id was not unique for each observation. We can find out more with the reshape error command.

. reshape error
(j = 80 81 82)

i (id) indicates the top-level grouping such as subject id.

The data are currently in the wide form; there should be a single
observation per i.

2 of 4 observations have duplicate i values:

     +----+
     | id |
     |----|
  2. |  2 |
  3. |  2 |
     +----+

(data now sorted by id)

reshape found that the observations that were not unique. To fix the data so that they can be reshaped, we can type

. replace id = 4 if id==2 & sex==0
(1 real change made)

. reshape long inc, i(id) j(year)
(j = 80 81 82)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations                4   ->   12
Number of variables                   5   ->   4
j variable (3 values)                     ->   year
xij variables:
                      inc80 inc81 inc82   ->   inc
-----------------------------------------------------------------------------

. list

     +------------------------+
     | id   year   sex    inc |
     |------------------------|
  1. |  1     80     0   5000 |
  2. |  1     81     0   5500 |
  3. |  1     82     0   6000 |
  4. |  2     80     1   2000 |
  5. |  2     81     1   2200 |
     |------------------------|
  6. |  2     82     1   3300 |
  7. |  3     80     0   3000 |
  8. |  3     81     0   2000 |
  9. |  3     82     0   1000 |
 10. |  4     80     0   2400 |
     |------------------------|
 11. |  4     81     0   2500 |
 12. |  4     82     0   2400 |
     +------------------------+

After we have used reshape once, Stata understands the structure, and if the data are currently long, we can simply type

. reshape wide
(j = 80 81 82)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations               12   ->   4
Number of variables                   4   ->   5
j variable (3 values)              year   ->   (dropped)
xij variables:
                                    inc   ->   inc80 inc81 inc82
-----------------------------------------------------------------------------

. list

     +----------------------------------+
     | id   inc80   inc81   inc82   sex |
     |----------------------------------|
  1. |  1    5000    5500    6000     0 |
  2. |  2    2000    2200    3300     1 |
  3. |  3    3000    2000    1000     0 |
  4. |  4    2400    2500    2400     0 |
     +----------------------------------+

and the data are switched to the wide form. On the other hand, if the data are currently wide, we can type

. reshape long
(j = 80 81 82)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations                4   ->   12
Number of variables                   5   ->   4
j variable (3 values)                     ->   year
xij variables:
                      inc80 inc81 inc82   ->   inc
-----------------------------------------------------------------------------

. list

     +------------------------+
     | id   year    inc   sex |
     |------------------------|
  1. |  1     80   5000     0 |
  2. |  1     81   5500     0 |
  3. |  1     82   6000     0 |
  4. |  2     80   2000     1 |
  5. |  2     81   2200     1 |
     |------------------------|
  6. |  2     82   3300     1 |
  7. |  3     80   3000     0 |
  8. |  3     81   2000     0 |
  9. |  3     82   1000     0 |
 10. |  4     80   2400     0 |
     |------------------------|
 11. |  4     81   2500     0 |
 12. |  4     82   2400     0 |
     +------------------------+

and the data are switched to the long form.

Once we have given the definitions, we can switch back and forth by typing reshape wide and reshape long for the remainder of our session without redefining the groups, variables, and constants.

Both the manual (see [D] reshape) and the online help (see help reshape) show this example with three years of data, and there is no reason you cannot use reshape with four, five, or more years of data, nor are you limited to one within-group variable. In some other example, we might have typed

. reshape long inc hours wksue, i(id) j(year)

to go from wide to long form. Say that year takes on values from 80 to 88 and there are some additional unspecified variables (sex, age, and ownshome).

Thus, reshape long inc hours wksue, i(id) j(year) says that the variables are

Wide form Long form
id id
sex, age, ownshome sex, age, ownshome
year
inc80, inc81, ..., inc88 inc
hours80, hours81, ..., hours88 hours
wksue80, wksue81, ..., wksue88 wksue

There are other more advanced features of reshape that you can learn about in the manual (see [D] reshape). Most cases can be handled with the simple syntax I have illustrated.