Title | Connecting points within groups | |
Author | Nicholas J. Cox, Durham University, UK |
To connect points with straight lines on a two-way graph, specify graph’s connect() option:
The l inside the parentheses is called the connect style, and connect(l) is probably the most common. It connects all the points shown on the graph, joining them according to the current sort order of the data. The connect() option may be abbreviated all the way down to c(), so most people would type
Understand that c(l) connects the points in the order of the data. If the data are time series in time order, this gives a line graph showing successive changes, say, from year to year. If the data are in some other order, c(l) may be useful for showing trajectories in the space defined by any two variables.
Another connect style is c(L): it joins points if and only if successive values of the x variable (on the horizontal axis) are in ascending order. To be precise, this does include cases where values of the x variable are constant. This may sound like a rather special case, but c(L) can be very useful for ensuring that points are joined only in groups.
The general recipe for connecting points within groups consists of three steps:
Say you wish to plot y versus x, connecting the points, by group. That is, you want (x,y) plotted and the points connected for group 1, (x,y) plotted and the points connected for group 2, etc., but you do not want the points in one group connected to the points in another. Were you to type
you would obtain a graph with all the points connected. Were you to type
you would obtain nearly what you want, but you would obtain separate graphs for each group. Let's assume you want all the points in one graph. Type
Below we explain why this often works and why it sometimes does not, and we show why
is a better solution.
We also show how to draw other graphs with distinct line segments.
Let’s look at some examples, modeled closely on questions that arose on Statalist.
I have panel data on the weights of several hundred babies at different ages. I want a plot in which each individual is represented by a distinct connected line.
Here are my data
age (weeks) baby_id weight (kg) 2 123 20 3 123 24 4 123 28 ... ... ... 2 654 19 3 654 23 4 654 27 ... ... ...
(yes, these are hefty babies).
To reiterate, I want to plot weight against age, by baby_id, connecting the points for each baby.
If you are lucky, you will need to type no more than
We are putting the data in order of babies and, within each baby, the age. Then we are connecting the points from left to right. This will work if the youngest age of each baby is younger than the oldest age of the baby that precedes it because
In real data, however, there might be some problems if babies drop out of or enter a study in the middle. Suppose that, after sorting our data, the last observation on one baby (baby 888) and the first observation on the next (baby 889) are
age (weeks) baby_id weight (kg) 21 888 45 24 889 34
Baby 889 is older (24 weeks) at its youngest than the preceding baby at its oldest (21 weeks). Stata will draw a line connecting these two babies because variable age is increasing.
The way around this is to order the babies so that this does not happen. Let’s call age0 the youngest age at which each baby is observed. Then we want to order the babies so that the babies with the largest values of age0 occur first in the data. Doing that will ensure, when we proceed from one baby to the next, age decreases which, in turn, will prevent c(L) from connecting the points between.
Obtaining the earliest age (minimum value of x) is easy,
Putting the data in order so that oldest babies occur first is easy:
gsort is a variation on Stata’s sort command; it allows us to put the data in ascending or descending order. We specify -age0 to obtain descending order on age0.
Now we are ready to draw our graph. Putting this all together, we type
In other cases, we might need to sort even more carefully using both the minimum and the maximum age recorded for each baby.
For the general problem, if we want to graph y versus x, connecting the points within the group, the solution is
Unfortunately, even this code is not bulletproof if we have the following situation, illustrated yet again by baby weights.
The random noise will be at most 0.005 and at least -0.005. For presentation purposes, we need to work at the axis titles as well.
I have time data with gaps. Data should have been measured regularly, but there are some observations with missing values (somebody was sick, we lost the record, whatever).
time var1 var2 1 3 4 2 4 5 3 5 6 4 . . 5 6 7 6 6 6 7 . . 8 5 5 9 4.5 5.2
graph var1 var2 time, c(l) draws lines boldly jumping across the gaps. Instead, I want an honest graph showing breaks.
If we could define a group variable that tied together contiguous observations, this would be the same problem as the one we just handled.
Here is how we make that variable:
Define the groups. We can set up a counter
var1 == . is 1 if var1 is missing and 0 if var1 is present. As we sum them, we get
time var1 var2 block 1 3 4 0 2 4 5 0 3 5 6 0 4 . . 1 5 6 7 1 6 6 6 1 7 . . 2 8 5 5 2 9 4.5 5.2 2
Every time we find a new missing value the counter jumps by 1. But notice that for this to work properly, it is essential that the data are sorted by time. Our data was sorted, but to be safe about it, we would have typed
Now that we have a group variable, we follow our generic solution, which is
In this case, group=block, x=time, and we have two y variables, var1 and var2. Substituting, we would type
So, the complete solution is