Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: How to set calibrated weights
From
Steve Samuels <[email protected]>
To
[email protected]
Subject
Re: st: How to set calibrated weights
Date
Wed, 24 Oct 2012 09:41:40 -0400
Veronica:
"Introduction to Wave 1 Data May 2012"
I look through the NIDS web site information for Wave 2 and finally resorted to a Google Search for ' "NIDS
svyset and got a hit to "Introduction to Wave 1 Data May 2012" at: http://www.nids.uct.ac.za/home/index.php?/Nids-Documentation/documents.html
There is the statement :
"In Stata the recommended svyset command is svyset [pw= w1_wgt], strata(w1_hhdc) psu( w1_hhcluster)."
This is incorrect syntax. The proper syntax for -svyset- would be
*************************************************
svyset w1_hhcluster [pw= w1_wgt], strata(w1_hhdc)
************************************************
Now, you have to find the equivalent w2_ variables. The clue to the PSU is that it takes on 400 unique values. It might be w2_hhcluster, but is could be w2_hhgeo. which you picked out as a cluster variable.
So the correct statement is likely to be either:
*************************************************
svyset w2_hhcluster [pw= w2_wgt], strata(w2_hhdc)
svydes
************************************************
OR
*************************************************
svyset w2_hhgeo [pw= w2_wgt], strata(w2_hhdc)
svydes
************************************************
The one with #units = 400 distinct values is correct If both show 400 units, see which one reproduces Table 2 of the "Introduction to Wave 1 Data May 2012". The following code can help do this:
*************************************
egen t_geo = tag(ww2_hhgeo)
egen t_cluster = tag(ww2_hhcluster)
tab w2_gc_prov if t_geo
tab w2_gc_prov if t_cluster
*************************************
So it is up to you to do the detective work and to study about survey design.
Good luck.
Steve
On Oct 21, 2012, at 5:05 AM, Veronica Galassi wrote:
Dear Steve,
Thank you very much for your time.
This is the quote from the document describing the sampling
methodology (Methodology: Report on NiDS Wave 1, page 9). This
technical document and the one explaining how weights have been built
can be found here:
http://www.nids.uct.ac.za/home/index.php?/Nids-Documentation/technical-papers.html.
"A stratified, two-stage cluster design was employed to be included in
the base wave. In the first stage, 400 PSUs where included from Stats
SA's 2003 Master Sample of 3,000 PSUs...A PSU is defined as a
geographical area that consists of at least one Enumeration Area (EA)
or several EAs from the 2001 census...In some cases it has been
necessary to add EAs to the original EA to meet the requirement of a
minimum of 74 households per PSU."
I tried to contact the organisation responsible for the survey asking
for more info regarding the PSU but they did not come back to me. The
reason why I called the clusters "cluster 1" and "cluster 2" is just
to distinguish them from each other. In the above-mentioned document
there is no clear reference to province and geographical type being
cluster 1 and 2. Looking at the variables in the dataset and reading
the documents, I deduced they were the two clusters in question.
This is what I typed when I tried not to specify the PSU:
"svyset [pw=w2_wgt], strata ( w2_gc_dc)|| w2_hhgeo|| w2_gc_prov"
And this is the error I got back (r198):"invalid use of _n;
observations can only be sampled in the final stage".
Yes, I tried to set the weights following the statement: "w2_gc_prov
[pw = w2_wgt], strata(w2_gc_dc) || w2_hhgeo" followed by svydes.
This is the output:
#Obs per Unit
----------------------------
Stratum #Units #Obs min mean max
-------- -------- -------- -------- -------- --------
1 1* 234 234 234.0 234
2 1* 469 469 469.0 469
3 1* 363 363 363.0 363
4 1* 214 214 214.0 214
5 1* 280 280 280.0 280
6 1* 307 307 307.0 307
7 1* 183 183 183.0 183
8 1* 315 315 315.0 315
9 1* 302 302 302.0 302
10 1* 210 210 210.0 210
12 1* 204 204 204.0 204
13 1* 431 431 431.0 431
14 1* 296 296 296.0 296
15 1* 425 425 425.0 425
16 1* 209 209 209.0 209
17 1* 222 222 222.0 222
18 1* 265 265 265.0 265
19 1* 153 153 153.0 153
20 1* 173 173 173.0 173
21 1* 638 638 638.0 638
22 1* 455 455 455.0 455
23 1* 651 651 651.0 651
24 1* 443 443 443.0 443
25 1* 573 573 573.0 573
26 1* 405 405 405.0 405
27 1* 206 206 206.0 206
28 1* 478 478 478.0 478
29 1* 388 388 388.0 388
30 1* 511 511 511.0 511
31 1* 375 375 375.0 375
32 1* 359 359 359.0 359
33 1* 328 328 328.0 328
34 1* 245 245 245.0 245
35 1* 317 317 317.0 317
36 1* 440 440 440.0 440
37 1* 278 278 278.0 278
38 1* 442 442 442.0 442
39 1* 376 376 376.0 376
40 1* 154 154 154.0 154
42 1* 347 347 347.0 347
43 1* 400 400 400.0 400
44 1* 236 236 236.0 236
76 2 374 124 187.0 250
81 2 237 50 118.5 187
82 2 187 3 93.5 184
83 2 384 73 192.0 311
84 2 205 2 102.5 203
88 2 233 14 116.5 219
171 1* 474 474 474.0 474
275 1* 403 403 403.0 403
572 1* 665 665 665.0 665
773 1* 285 285 285.0 285
774 1* 505 505 505.0 505
-------- -------- -------- -------- -------- --------
53 59 18252 2 309.4 665
3703 = #Obs with missing values in the
-------- survey characteristics
21955
After having set the weights in this way, I tried to conduct some
descriptive statistics by typing:"svy: mean (tot_grem_k) if
tot_grem_k>0 & w2_a_cgprv1!=10"
I got back the mean but the standard errors were missing. In fact,
Stata gave me back the following note:"Note: missing standard error
because of stratum with single sampling unit.",as it is clearly shown
in the table above.
I hope this clarifies the sampling methodology a bit.
Thank you so much for your precious help, I am learning a lot from
your comments!!!
Kind regards,
Veronica
2012/10/20 Steve Samuels <[email protected]>:
>>
>> On Oct 20, 2012, at 5:08 AM, Veronica Galassi wrote:
>>
>> Dear Steve,
>>
>> Thank you very much for your kind reply and the useful references!
>> Your answer actually clarified many other doubts I had.
>>
>> Your intuition that my post-stratified weights are calibrated is
>> correct. Unfortunately, I checked again the documents explaining the
>> sampling methodology and there the PSU is simply defined as a
>> geographic area containing more than 74 dwellings. Therefore I expect
>> the number of PSU to be high (around 3,000) whereas I only have 9
>> provinces and 4 geographical types in my survey. This implies that
>> none of my cluster variables can be the PSU.
>
> You still haven't persuaded me. I'd have to see the quote from the study
> documents. Or, better, post a link to them if they are online. You'd
> better figure out what role, if any, the cluster variables have in the
> design. Why did you name them "cluster 1" and "cluster 2"?
>> However, if I got your point, it does not really matter which PSU I
>> indicate when conducting descriptive statistics. Is it correct?
>
> No, it is not. It is scientifically irresponsible to publish estimates
> of descriptive statistics without indications of uncertainty (SEs, CIs).
>
>> For
>> this reason, I also tried not to indicate any PSU but Stata gave me
>> back the error: "invalid use of _n; observations can only be sampled
>> in the final stage".
> See FAQ Section 3.3 First stence
>
>> To cut it short, do you still believe I can use the statement "svyset
>> w2_gc_prov [pw = w2_wgt], strata(w2_gc_dc) || w2_hhgeo" you previously
>> indicated to set my calibrated weigths? ( In my case I cannot use the
>> fpc option).
>
> I don't know, because you have not yet correctly described the sampling
> design. As an aside, ave you even tried the statement, which assumed
> that w2_gc_prov is the OSY? When you do, follow it by -svydes-.
>
>>
> 2012/10/20 Steve Samuels <[email protected]>:
>> Veronica,
>>
>> The PSU variable is not missing. It is the sampling unit at the first
>> stage of sampling and it's one of your cluster variables, probably
>> "cluster 1" (check). Your statement that one must know the PSU variable
>> to use probability weights is also incorrect. One can get proper
>> weighted estimates, though not standard errors, without knowing the PSU.
>>
>> I'm not sure what wrong with your -concat- statement. I would have
>> used "egen combination = group()". For it to have worked, the value of
>> the "post-stratification weight" would have to be the population count
>> for each combination of the three variables.
>>
>> If the "post-stratification" weights are not integers, they are probably
>> "calibration" weights that have already adjusted the probability
>> weights. In that case, further post-stratification are likely to be
>> superfluous. You would then use the "post-stratification weight" in place of
>> the probability weights. All weights should be
>> described in the study documents (though usually not the"codebook"). If
>> they are not, then contact the organization that did the study for
>> details.
>>
>> If sampling was without replacement at one or more stages,
>> you could use the fpc() option for those stages. In practice,
>> it makes a difference only for the first stage.
>>
>> In any case, one guess at a -svyset- statement (assuming the
>> "post-stratification weight" is a "calibration" weight) is:
>> *************************************************************
>> svyset w2_gc_prov [pw = w2_wgt], strata(w2_gc_dc) || w2_hhgeo
>> **************************************************************
>>
>> But I could be wrong, depending on how w2_wgt was calculated.
>>
>> Before proceeding, I suggest that you learn more about sampling or take
>> a survey course. I gave some references in:
>> http://www.stata.com/statalist/archive/2012-09/msg01058.html.
>> The Stata survey manual is also a very good resource, though the section on
>> post-stratification is skimpy.
>>
>> Steve
>>
>>
>> On Oct 19, 2012, at 1:57 PM, Veronica Galassi wrote:
>>
>> Dear Statalisters,
>>
>> I am writing you concerning the application of calibrated weights to
>> my dataset for the computation of descriptive statistics only.
>>
>> The dataset I am working on collects information at household and
>> individual level and comes from a stratified, two-stage clustered
>> sample. The followings are the variables I have got:
>> - probability weights: w2_dwgt
>> - strata: w2_gc_dc
>> - cluster 1: w2_gc_prov
>> - cluster 2: w2_hhgeo
>> - post-stratified weights: w2_wgt
>> - age intervals: w2_age_intervals
>> - gender: w2_best_gen
>> - population group: w2_best_race
>>
>> In order to set the probability weights using the command svyset, I
>> need the psu variable. As you may have noticed, this variable is
>> missing and this makes me impossible to set pweights.
>> In addition, from a couple of previous statalist conversations ( see
>> in particular: http://www.ats.ucla.edu/stat/stata/faq/svy_stata_post.htm
>> and http://www.stata.com/statalist/archive/2012-02/msg00584.html), I
>> understood that:
>> - when using calibrated weights I still have to set pweights and
>> specify the original strata and clusters
>> - In order to apply calibrated data I need to know the characteristics
>> on the base of which the sample have been post-stratified ( in my case
>> age intervals, gender and population groups).
>>
>> Therefore, I tried to set my post-stratified weights using the
>> following command:
>> "svyset [pw=w2_dwgt], strata (w2_gc_dc) poststrata (w2_age_intervals
>> w2_best_gen w2_best_race) postweight(w2_wgt)"
>> which did not work because in Stata the poststrata must be mutually
>> exclusive and thus only one variable can be specified.
>>
>> In order to overcome this problem, I tried to generate a variable
>> which is a combination of the three characteristics by using the
>> command
>> "egen combination=concat( w2_age_intervals w2_best_race w2_best_gen),
>> format (float)".
>> However, this command generated a variable containing only missing
>> values and for this reason Stata gave me back the error:
>> "option postweight() requires option poststrata()".
>> The only way to make Stata set the post-calibrated weight was by using
>> the command
>> "svyset, poststrata (combination) postweight(w2_wgt)" with combination
>> being a string variable. However I am scared that this command is not
>> complete.
>>
>> At this point, I would really appreciate any hint on what I am doing
>> wrong and how to proceed to set my post-stratified weights.
>>
>> Many thanks for your help!
>>
>> Kind regards,
>>
>> Veronica Galassi
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/