Sandu Cojocaru <[email protected]> asked,
> I'm having trouble generating a variable that for each member i equals
> sum(Cj-Ci) over all Cj>Ci where i and j are members of the same group.
> Here's an example of the data setup - I'm trying to calculate
> `outcome_var'.
> For row 1 outcome_var=0, for row 3 = (200-100)+(300-100) = 300...and so on...
>
> group_id member_id C outcome_var
> 1 1 300 0
> 1 2 200 100
> 1 3 100 300
> 2 1 150 50
> 2 2 200 0
> 2 3 100 150
> 2 4 50 300
> 3 1 and so on...
This question has already been answered elegantly by Martin Weiss
<[email protected]>. His answer was,
> clear*
>
> input byte(group_id member_id) C
> 1 1 300
> 1 2 200
> 1 3 100
> 2 1 150
> 2 2 200
> 2 3 100
> 2 4 50
> end
>
> compress
> list, noo sepby(group_id)
>
> bys group_id (C): /*
> */ gen diff=C[_n+1]-C[_n]
> bys group_id: gen num=_N-_n
> bys group_id (num): /*
> */ gen outcome_var=sum(diff*num)
> sort group_id member_id
>
> drop diff num
> list, noo sepby(group_id)
I'm about to give a different answer. Sometimes one needs to create a
variable that is a complicated combination of values in different
observations. There is always a way to do it in Stata, but somtimes
the solution is elusive and one wished one could just loop across
the observations and make the calculation directly even if that solution
was inefficient. I want to show how to do that using Mata.
The basic recipe is
1. Enter Mata:
. mata
: _
2. Create individual Mata variables that are a view onto each of
relevant Stata variables. In the above, the relevant Stata variables
are group_id and and C, so create Mata variables of the same
name:
: st_view(group_id=., ., "group_id")
: st_view(C, ., "C")
: _
3. Go back to Stata and create the the desired new variable, filled
with missing values. Create a view onto that, too. In this
example, the new desired variable is outcome_var:
: end
. gen outcome_var = .
. mata
: st_view(outcome_var=., ., "outcome_var")
4. Loop in Mata to fill in the new variable.
Before showing the solution to Sandu's problem, let me show how
this works in an easier examples.
An easy example
----------------
We want to create new variable newx equal to x+1. We could do this
in Stata by typing
. gen newx = x + 1
Alternatively, we could achieve the same result by typing,
. gen newx = .
. mata
: st_view(x=., ., "x")
: st_view(newx=., ., "newx")
: for (i=1; i<=st_nobs(); i++) {
: newx[i] = x[i] + 1
: }
: end
Try it. The result after typing all that Mata code will be the same
as -gen newx = x + 1-.
Note the use of Mata function st_nobs() to obtain the number of
observations in the dataset.
Panel data (by)
---------------
Panel data adds complication. Pretend we wanted to code the Mata
equivalent to
. by group: gen newx = x + 1
I know the -by group:- prefix adds nothing to the statement, but
at this point I want to keep the example simple.
The equivalent Mata code is,
. gen newx = .
. mata
: st_view(group=., ., "group")
: st_view(x=., ., "x")
: st_view(newx=., ., "newx")
: obs = panelsetup(group, 1)
: for (g=1; g<=rows(obs); g++) {
: for (i=obs[g,1]; i<=obs[g,2]; g++) {
: newx[i] = x[i] + 1
: }
: }
: end
In the above code, I assume the data are already sorted by group.
Note the line
: obs = panelsetup(group, 1)
If we had two groups -- it wouldn't matter if they were numbered 1 and 2
or 6*_pi and 9 -- and we had three observations in the first group
and five in the second, matrix obs would contain
1 3
4 8
The first row states the observation numbers corresponding the first
group (1 to 3); the second grow states the observation numbers
corresponding to the second (4 through 8). The matrix has two rows
because there are two groups. The matrix always has 2 columns.
See -help mata panelsetup()-.
In the loop that follows, the outer loop (g) loops across the by
groups. The inner loop (i) loops across the observations within
the group.
Putting it all together; the solution to Sandu's problem
--------------------------------------------------------
Here is the solution to Sandu's problem:
. sort group_id
. gen outcome_var = .
: mata:
: st_view(group_id=., ., "group_id")
: st_view(C=., ., "C")
: st_view(outcome_var=., ., "outcome_var")
: obs = panelsetup(group_id, 1)
: for (g=1; g<=rows(obs); g++) {
: for (i=obs[g,1]; i<=obs[g,2]; i++) {
: sum = 0
: for (j=obs[g,1]; j<=obs[g,2]; j++) {
: if (C[j]>C[i]) sum = sum + (C[j]-C[i])
: }
: outcome_var[i] = sum
: }
: }
: end
Note line the line
if (C[j]>C[i]) sum = sum + (C[j]-C[i])
That line is coded almost exactly as Sandu stated the problem:
He requested the sum(Cj-Ci) over all Cj>Ci where i and j are members
of the same group.
In the code above, the outer loop (g) loops over group_id. The next
loop (i) loops over the members of the group. The inner loop (j)
also loops over the members of the group so that we obtain all
combinations of i and j.
Martin's solution executes more quickly than the above solution. I
tried both solutions on 5,000 groups, each with 100 members. Martin's
solution ran in 2.28 seconds. Mine took 35 seconds! That's not so
much Mata's fault as mine. My solution is not cleaver; I performed
the -if (C[j]>C[i]) sum = sum + (C[j]-C[i])- statement 50,000,000 times!
So what? My solution was not clever and neither did it depend on me
being clever. I wonder which one of us had a solution to this problem
sooner? I just plugged into the recipe:
1. Enter Mata.
2. Create individual Mata variables that are a view onto each of
relevant Stata variables.
3. Go back to Stata and create the the desired new variable, filled
with missing values. Create a view onto that, too.
4. Loop in Mata to fill in the new variable.
The only new code I wrote was for (4), and that read
: for (g=1; g<=rows(obs); g++) {
: for (i=obs[g,1]; i<=obs[g,2]; i++) {
: sum = 0
: for (j=obs[g,1]; j<=obs[g,2]; j++) {
: if (C[j]>C[i]) sum = sum + (C[j]-C[i])
: }
: outcome_var[i] = sum
: }
: }
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/