Mirko <[email protected]> writes,
> I have prepared the following example.
> Given this data:
>
> id region educ age income
> 1 2 1 25 5
> 1 2 1 26 5
> 2 2 1 29 8
> 2 2 1 30 8
> 3 2 1 32 11
> 3 2 1 33 11
> 4 1 1 40 5
> 4 1 1 41 5
> 5 1 2 37 8
> 5 1 2 38 8
> 6 1 2 42 9
> 6 1 2 43 9
>
> I need to create the variable yref, that is the average income of a
> reference group defined by the same level of education, the same
> region and the same cohort (i.e., people who are 5 years younger and 5
> years older).
I think I have a solution. Mirko gave answers for this small dataset
and, in two places, my solution does not give the same results as Mirko
does, but I'm hoping Mirko did the pencil-and-paper calculation wrong.
If I made a mistake, finding it should not be too difficult.
My approach is to create ref.dta containing region, educ, age, and yref.
I said to myself, if I had that dataset, then getting what Mirko wants
would be easy:
. use master // Mirko's original dataset
. sort region educ age
. merge region educ age using ref, nokeep
So now let's think about creating ref.dta.
Every observation in Mirko's original dataset appears in 11 cohort
groups, age-5, age-4, ..., age, age+1, age+2, ..., age+5.
So let's start with the original dataset and make a dataset eleven times
bigger (each original observation appears in 11 different cohorts). Then we
can collapse the big dataset into averages.
Here is my solution to the make-the-giant-dataset problem:
. use master, clear
. rename age cohort
. drop id
. drop in f/l
. save sofar, emptyok replace
. forvalues i= -5(1)5 {
. use master, clear
. drop id
. replace age = age + `i'
. rename age cohort
. append using sofar
. save sofar, replace
. }
I'm sure sofar.dta could have been made more efficiently.
I suspect someone is going to point out that we could -expand- the data,
then -sort-, and then do some very clever -replace-. Above is what occurred
to me, however.
Now that I have sofar.dta, I can make ref.dta from it:
. sort region educ cohort
. by region educ cohort: gen yref = sum(income)/_n
. by region educ cohort: keep if _n==_N
. drop income
. rename cohort age
. sort region educ age
. save ref, replace
and now I'm read to do the last step:
. use master
. sort region educ age
. merge region educ age using ref, nokeep
I have a do-file below my signature. Here's the output it produced:
. list
+----------------------------------------------+
| id region educ age income yref |
|----------------------------------------------|
1. | 1 2 1 25 5 6.5 |
2. | 1 2 1 26 5 6.5 |
3. | 2 2 1 29 8 8 |
4. | 2 2 1 30 8 8 |
5. | 3 2 1 32 11 9.5 |
|----------------------------------------------|
6. | 3 2 1 33 11 9.5 |
7. | 4 1 1 40 5 5 |
8. | 4 1 1 41 5 5 |
9. | 5 1 2 37 8 8.333333 | <--
10. | 5 1 2 38 8 8.5 |
|----------------------------------------------|
11. | 6 1 2 42 9 8.5 |
12. | 6 1 2 43 9 8.666667 | <--
+----------------------------------------------+
My program differs from Mirkos expectation in two observations.
-- Bill
[email protected]
------------------------------------------------------------------------------
clear
input id region educ age income
1 2 1 25 5
1 2 1 26 5
2 2 1 29 8
2 2 1 30 8
3 2 1 32 11
3 2 1 33 11
4 1 1 40 5
4 1 1 41 5
5 1 2 37 8
5 1 2 38 8
6 1 2 42 9
6 1 2 43 9
end
sort region educ age
save master, replace
use master, clear
rename age cohort
drop id
drop in f/l
describe
save sofar, emptyok replace
forvalues i= -5(1)5 {
use master, clear
drop id
replace age = age + `i'
rename age cohort
append using sofar
save sofar, replace
}
sort region educ cohort
by region educ cohort: gen yref = sum(income)/_n
by region educ cohort: keep if _n==_N
drop income
list
rename cohort age
sort region educ age
save ref, replace
use master, clear
merge region educ age using ref, nokeep
sort id educ age
list
------------------------------------------------------------------------------
<end>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/