[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SV: st: Survey - raking - calibration - post stratification - calculating weights

From	Steven Samuels <[email protected]>
To	[email protected]
Subject	Re: SV: st: Survey - raking - calibration - post stratification - calculating weights
Date	Sun, 7 Dec 2008 13:20:33 -0500

Correction: "by age" in the code examples should have been "by agex".

On Dec 7, 2008, at 11:01 AM, Steven Samuels wrote:

Kristian, raking on the two or more variables, with the totalscoming from different populations, is easy.
1. Create the initial weight1 =N/n with "population" N and sample nin age groups as Stas and I suggested in the previous email.
2. Then, create categorized variables for age, medicin, smoke Youwill create counts for these categories (tot_age, tot_medicin,tot_smoke) from the control percentages, but with a "populationsize" of 10,000 across all.
2.1 Age: These will be numbers based on percentages in theoriginal 5,000 men, though it would be *much* better to base themon the Danish Census data. (If I were a journal reviewer, I wouldnot accept a publication that did not do this unless there was avery good reason.) The data source (5,000 men or census) is knownas the "external" or "control" population for age.
I would suggest you create a variable with fewer than 15categories, as too many categories can prevent the raking algorithmfrom working. I will call the variable agex
You must compute the percentages of observations in each categoryof agex externally and merge them into the 600 man data set.
For example, suppose that in the control population, the first fewcategories of agex have the following percentages
agex    pct_agex     tot_agex (= 100 x pct_agex, rounded to nearest 1)

1         8.23          823
2        10.41         1041
etc.


Total   100.00        10,000
Important: If the totals do not add to 10,000 then adjust thecounts of the largest few categories so they do.
You can add tot_agex by hand to the 600 man data set, or create itexternally and merge it in.
2.2. For medicin, do the same kind of categorization, but base thepercentages on the 3,750 man data set. Here I assume that medicin,has three categories.
medicin    pct_medicin    tot_medicin
1           30.23         3023
2           45.86         4586
3           23.93         2393
Total      100.02        10002
The original totals must be adjusted so that they add up exactly to10,000. In this case, for example I would subtract 1 from totalsfor the largest two groups. 3023->3022 and 4586 ->4585
2.3. You can also do the same with smoking: create smokecategories and tot_smok as the totals in each which add to 10,000exactly. In fact, if the number of smoking and medicincombinations is small (say 3 x 3 = 9), you can create a combinedvariable, with the percentages in each.
med_smok     pct_med_smok    tot_medsmok
1
2
3
..
9
If you do this, then you do not need the separate medicinadjustment and smoke margins.
3. Rake the three control variables (agex, medicin, smoke)simultaneously.
**************************CODE BEGINS**************************
survwgt rake  weight1,   ///
       by(age medicin smoke) ///
       totvars(tot_agex tot_medicin tot_smoke ///
       gen(weight2)
***************************CODE ENDS***************************
Or, with a combined med_smok margin.
**************************CODE BEGINS**************************
survwgt rake  weight1,   ///
       by(age med_smok) ///
       totvars(tot_agex tot_med_smok ///
       gen(weight2)
***************************CODE ENDS***************************
(Note the comma in the first line, which was missing from myprevious post.) Rarely will you need more than the default 10iterations in -survwgt rake-. If you do, the program will issue anerror message. You can increase the number by adding a -maxrep-option at the end: e.g. "maxrep(100)"
If the number of sample observations in any control cell (agex,medicin, smoke (or medicin_smoke) is too small, then the programmay not converge or will take a long time. In that case, you willneed to merge sparse adjacent categories. Suppose, for example,that you start out with 9 medicin_smoke combinations, but two ofthem have few observations among the 600 men final sample. Thenmerge these into adjacent categories and create a new 7 categoryvariable.
4. Finally: -svyset- your data and run Stata's survey programs:

svyset _n [pweight=weight2], strata(age_gp)
Here "age_gp" is your original age variable with 15 categories.You can probably omit the strata option at no loss. Be sure that ifyou want estimates for subpopulations, you do use the -subpop-option and not an "if" option.
-Steven



On Dec 7, 2008, at 4:52 AM, Kristian Wraae wrote:
Thanks Stas & Steven
What I would like to do is to calibrate on some of the measuresfrom the
first questionaire.
I have data on 3750 men from that first questionnaire and I wouldlike totransform my 600 man population into my 5000 man population sothat thedistribution of chronic diseases and medication is the same as wewould
expect it to be in the 5000 man population.

I know how the 5000 men differs from the 3750 men regarding age and
geaography. There was a slight effect of age, but geography was not
important for non-responders. So adjusting for age is really theonly thing
needed at this step.

Then I know how the 600 differs from the 3750 men. The 600 are better
educated, smoke less and do more exercise and then they areslightly less
prone to have chronic diseases and then they are slightly younger.
So I'd like to weight each of the 600 men so that I can compensateforeducation, smoking, physical activity, chronic diseases (andmedication butthey are closely related so I think I'll just adjust formedication as it is
the most precise measure) and age.

So if I want to adjust for those, how do I go by that?
I can see that the code below will adjust on age and geographysince thosedata are present through the two steps, but the more detailedinformation on
smoing, health and lifestyle is only present in step two.
I don't know the tot_medgb (medicin) or tot_smokegp (smoking)amongst the
5000 but only amongst the 3750.
That is how do I incoorporate the two steps into the raking? Orshould I usethe post stratification command instead since I know these data onthe
individual level?
As I see it running two rakings after each other: one for step 1and one forstep 2 would risk changing the what has been done in the firstraking.
I might be stupid but I don't really see how I can do this usingthe code
below.

Also,how many variables is it adviseable to rake on?

Thank you for your help
Kristian



-----Oprindelig meddelelse-----
Fra: [email protected]
[mailto:[email protected]] På vegne af StevenSamuels
Sendt: Sunday, December 07, 2008 6:43 AM
Til: [email protected]
Emne: Re: SV: st: Survey - raking - calibration - poststratification -
calculating weights


--

Stas, I am envious of statisticians who draw samples from those
lists.  This is a double sample and I agree with your advice: give
everyone the weight for their age stratum:
                           weight1 = N_i/n_i
where "N" denotes population and "n" denotes sample size.  Kristian
apparently thinks of the 5,000 person sample as his "population"; the
figure that he linked to does not show the initial sampling step at
all. He may not have access to  the one-year census counts. If he
does not, I suggest that he use the N's from the 5,000.  I  suggest
below that he also form  geographic categories and rake those, with
population counts, if possible, otherwise with counts from the
5,000.  I roughly calculate that with 5,000 in the first phase
sample, bias in estimates and in standard errors will be small.

Kristian, here is how to simultaneously match the age distribution
and the geographic distribution of the final sample to your
population. (This is called "sample balancing" or "raking".)  Form
age groups (agegp) and geographical groupings (geogp) and get the
population counts(or percentages, see below) in each cell.

**************************CODE BEGINS**************************
* tot_agep =  total for population in participant age group (agegp)
* tot_geogp = total for population in participant geographical group
(geogp)
**************************************************************

survwgt rake  weight1  ///
       by(agegp geogp) ///
       totvars(tot_agegp tot_geogp ///
       gen(weight2)
***************************CODE ENDS***************************


Raking can present problems, so so I suggest that you read http://
www.abtassociates.com/ presentations/raking_survey_data_2_JOS.pdf.If you
cannot get
population counts, perhaps you can get population percentages,
multiply by 10 or 100 and  round to the nearest whole number (e.g.
5.12% = 51 or 512), so that the population "size" is 1,000 or 10,000.
For estimating means and proportions, these will yield nearly the
same results as actual population counts. The Denmark census counts
or percentages might be available only in larger age categories than
the ones you used to draw the sample: say (60-64, 65-70,70-74). If
so, use those for the raking calculations.

If you have, say, four geographical categories, you may be tempted to
use  4 x 15 =60 stratification combinations.  However, with only 600
people in the final sample, the numbers in individual cells will be
too small for reliable estimation.

Theory for double sampling can be found in WG Cochran, 1973, Sampling
Techniques, pp 117-119, 327-334,  or in most other texts.
Unfortunately, raking will not completely solve the problem of non-
response.

-Steven

On Dec 6, 2008, at 11:19 PM, Stas Kolenikov wrote:
Steven,

you might be shocked, but people in Nordic countries do have their
population completely enumerated. Putting NJC's hat on :)), let me
remind you that this is an international list, and differentcountrieshave different standards of how they collect and store theirofficial
data. Denmark has a register with an equivalent of SSN that makes it
possible to combine the data three ways from economic, medical and
social perspectives. That's a survey statistician and a
microeconometrician dream... and they actually do have thecapacity of
drawing SRS. That is, the first 5000 were SRS of the population, and
then Kristian continued a with stratified second phase sampling.

I would probably just give everybody the weight = # in age group
across Denmark (in some meaningfully defined period of thestudy) / #
in age in group in the sample. If you treat sample groups as
non-response adjustment cells, that's what this will probably boil
down to after multiplication of three or so fractions. ches and help
try:
*


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
  - From: "Kristian Wraae" <[email protected]>

Prev by Date: Re: st: R: R: bootstrapped p-values
Next by Date: SV: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
Previous by thread: Re: SV: SV: SV: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
Next by thread: st: R: R: bootstrapped p-values
Index(es):
- Date
- Thread