I have now tried to do the first step of the raking.
I have 15 age groups and 67 geographic groups (simply based on the zip
codes).
I tried to do the raking first with a smaller number of geographic groups
(10) but the results were more accurate with all groups.
The variable I have are:
age = continuos variable containg the age of the subject at the time of
sampling
dist_study = continuous variable containing the distance from the individual
to me.
age_grp = categorial variable - 15 age strata.
geo_grp = zip code
quest = 1 if individual returned a filled out questionnaire
pop = 1 if individual was amongst the 4975 in the original sample (all had
of course pop=1)
sample = 1 for each finally included subject.
The do file looks like this:
*************
*To get data from the orginal population
tabstat age
tabstat dist_study
*Raking starts by generating totals in each age group and geographical group
egen tot_age_grp = count(pop),by(age_grp)
egen tot_age_grp_q = count(pop) if quest==1, by(age_grp)
egen tot_geo_grp = count(pop),by(geo_grp)
egen tot_geo_grp_q = count(pop) if quest==1, by(geo_grp)
*Inital weight is generated
gen weight1x = (tot_age_grp / tot_age_grp_q)
keep if quest==1
*(reducing the dataset to 3743 men)
survwgt rake weight1x, ///
by(age_grp geo_grp) ///
totvars(tot_age_grp tot_geo_grp) ///
gen(weight2x)
svyset [pweight=weight2x], strata(age_grp)
*Description
svydes
*Now we estimate the average age in the 4975 men from the 3743 men
svymean age
*Now we estimate the average distance to travel to get to me for the 4975
men based on the 3743 men
svymean dist_study
*These are the actual numbers for the 3743 men.
tabstat age
tabstat dist_study
******************
The output from Stat8 is:
. *************
. tabstat age
variable | mean
-------------+----------
age | 66.6695
------------------------
. tabstat dist_study
variable | mean
-------------+----------
dist_study | 25.90153
------------------------
.
.
. egen tot_age_grp = count(pop),by(age_grp)
. egen tot_age_grp_q = count(pop) if quest==1, by(age_grp)
(1232 missing values generated)
.
. egen tot_geo_grp = count(pop),by(geo_grp)
. egen tot_geo_grp_q = count(pop) if quest==1, by(geo_grp)
(1232 missing values generated)
.
. gen weight1x = (tot_age_grp / tot_age_grp_q)
(1232 missing values generated)
.
. keep if quest==1
(1232 observations deleted)
. *(reducing the dataset to 3743 men)
. survwgt rake weight1x, ///
> by(age_grp geo_grp) ///
> totvars(tot_age_grp tot_geo_grp) ///
> gen(weight2x)
.
. svyset [pweight=weight2x], strata(age_grp)
pweight is weight2x
strata is age_grp
.
. svydes
pweight: weight2x
Strata: age_grp
PSU: <observations>
#Obs per PSU
Strata ----------------------------
age_grp #PSUs #Obs min mean max
-------- -------- -------- -------- -------- --------
1 346 346 1 1.0 1
2 333 333 1 1.0 1
3 304 304 1 1.0 1
4 297 297 1 1.0 1
5 284 284 1 1.0 1
6 275 275 1 1.0 1
7 249 249 1 1.0 1
8 246 246 1 1.0 1
9 231 231 1 1.0 1
10 209 209 1 1.0 1
11 212 212 1 1.0 1
12 210 210 1 1.0 1
13 184 184 1 1.0 1
14 174 174 1 1.0 1
15 189 189 1 1.0 1
-------- -------- -------- -------- -------- --------
15 3743 3743 1 1.0 1
.
. svymean age
Survey mean estimation
pweight: weight2x Number of obs =
3743
Strata: age_grp Number of strata =
15
PSU: <observations> Number of PSUs =
3743
Population size =
4975
----------------------------------------------------------------------------
--
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------+------------------------------------------------------------------
--
age | 66.66605 .0067455 66.65283 66.67928 .0092211
----------------------------------------------------------------------------
--
. svymean dist_study
Survey mean estimation
pweight: weight2x Number of obs =
3742
Strata: age_grp Number of strata =
15
PSU: <observations> Number of PSUs =
3742
Population size =
4973.7235
----------------------------------------------------------------------------
--
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------+------------------------------------------------------------------
--
dist_s~y | 25.90772 .3139459 25.2922 26.52325 1.01731
----------------------------------------------------------------------------
--
.
. tabstat age
variable | mean
-------------+----------
age | 66.5895
------------------------
. tabstat dist_study
variable | mean
-------------+----------
dist_study | 25.93867
------------------------
.
end of do-file
As one can see the average age amongst the 4975 men is: 66.6695
Using raking and svymean Stata estimates the average age amongst the 4975
men based on the information from the 3743 men to be: 66.66605
As one can see those are quite similar.
Now let us look at the distance to travel. We raked on zip codes which are
not equivalent to distances but despite that the results are quite amazing:
We know the average distance to travel is: 25.90153 km
After raking and basing the results on the 3743 men Stata estimates the
distance to be: 25.90772 km
Strikingly similar. The true distributions amongst the 3743 are not as
close: 66.5895 years and 25.93867 kms, but really not that far off.
The differences will be far greater when raking the 600.
I will now go on.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/