I don't understand the reluctance to -reshape-. I am going to assume
that you do that.
Your example suggests as code
tokenize 0.5 17.5 24.5 44.5 64.5 81
qui forval i = 1/5 {
local j = `i' + 1
gen grp_`i' = max(min(stop, ``j'') - max(start, ``i''), 0) ///
if start < . & stop < .
}
l
Here are the results:
. l
+------------------------------+
| id activity start stop |
|------------------------------|
1. | 1 1 6 15 |
2. | 1 2 22 25 |
3. | 1 3 15 16 |
4. | 1 4 22 28 |
5. | 1 5 30 . |
|------------------------------|
6. | 1 6 . . |
7. | 2 1 53 69 |
8. | 2 2 69 79 |
+------------------------------+
. tokenize 0.5 17.5 24.5 44.5 64.5 81
. qui forval i = 1/5 {
2. local j = `i' + 1
3. gen grp_`i' = max(min(stop, ``j'') - max(start, ``i''), 0)
///
if start < . & stop < .
4. }
. l
+----------------------------------------------------------------------+
| id activity start stop grp_1 grp_2 grp_3 grp_4
grp_5 |
|----------------------------------------------------------------------|
1. | 1 1 6 15 9 0 0 0
0 |
2. | 1 2 22 25 0 2.5 .5 0
0 |
3. | 1 3 15 16 1 0 0 0
0 |
4. | 1 4 22 28 0 2.5 3.5 0
0 |
5. | 1 5 30 . . . . .
. |
|----------------------------------------------------------------------|
6. | 1 6 . . . . . .
. |
7. | 2 1 53 69 0 0 0 11.5
4.5 |
8. | 2 2 69 79 0 0 0 0
10 |
+----------------------------------------------------------------------+
Nick
[email protected]
Thomas Speidel
I am attempting to compute several time points to calculate the
interval (years) between the start and the end of an activity and to
assign that interval to its relevant age group. For example, given
the following dataset:
id activity start stop
1 1 6 15
1 2 22 25
1 3 15 16
1 4 22 28
1 5 30 .
1 6 . .
2 1 53 69
2 2 69 79
I am trying to derive the following:
id activity start stop grp_0_17 grp_1~24 grp_2~44
grp_4~64 grp_6~81
1 1 6 15 9 0 0
0 0
1 2 22 25 0 2.5 .5
0 0
1 3 15 16 1 0 0
0 0
1 4 22 28 0 2.5 3.5
0 0
1 5 30 . 0 0 1
0 0
1 6 . . . . .
. .
2 1 53 69 0 0 0
11.5 4.5
2 2 69 79 0 0 0
0 10
The age groups are:
[0.5, 17.5]
[17.6, 24.5]
[24.6, 44.5]
[44.6, 64.5]
[64.6, 81]
If the dataset was in long format as above, it would not be terribly
hard. To slightly complicate things is the fact that the interval may
need to be correctly allocated when it falls between two or more age
groups. However, my data is in wide format (single observation per
row) making it a nightmare to even check or troubleshoot my code (I
have 40 activities per id), and the data is so large that I am
reluctant to reshape it.
This is what the dataset above would look like:
id start1 stop1 start2 stop2 start3 stop3 start4
stop4 start5 stop5 start6 stop6
1 6 15 22 25 15 16 22
28 30 . . .
2 53 69 69 79 . . .
. . . . .
-The activities do not necessarily follow a temporal sequence (e.g.
3rd observation on top)
-While the example does not show that, every id has exactly 40
activities, even though many of them may be completing missing.
-Whenever a start is present but its corresponding stop is missing (as
in the 6th obs. on top), it means that at the time of the study the
person was still performing that activity, hence stop would be a
variable called ageref. If start==ageref, then the interval would be
approximated as 1 year.
I would appreciate any feedback on how to best tackle this problem.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/