Many (perhaps most) social survey datasets come with non-integer
weights, reflecting a mix of the sampling schema (e.g. one person per
household randomly selected), and sometimes non-response, and sometimes
calibration/grossing factors too. Increasingly, in the name of
confidentiality, data depositors are reluctant to identify too much
about the sampling points -- thus making PSU identification not always
possible [and hence svy approaches in stata not really practicable].
At present, stata will let you use some types of weights, some of the
time, on some types of command. The logic of which is hard to fathom.
I appreciate that a simple-minded application of weights will give you
incorrect confidence intervals. But at present stata makes it
difficult to get the right point estimates in these circumstances.
Here's a very simple example of what can happen, based on a simple
indicator variable and a simple weight.
. list
+--------------+
| male wgt |
|--------------|
1. | 0 1.5 |
2. | 0 1.2 |
3. | 1 .7 |
4. | 1 1.1 |
5. | 0 .7 |
|--------------|
6. | 1 .8 |
+--------------+
. su male [w=wgt] /// So summarize defaults to aweights.
(analytic weights assumed)
Variable | Obs Weight Mean Std. Dev. Min
Max
-------------+-----------------------------------------------------------------
male | 6 6.00000006 .4333333 .5428321 0
1
. tab1 male [w=wgt] /// tab1 defaults to frequency weights, not allowed
(frequency weights assumed)
may not use noninteger frequency weights
r(401);
. tab1 male [iw=wgt] /// tab1 disallows iweights
iweight not allowed
r(101);
. tab1 male [aw=wgt] /// tab1 disallows aweights
aweight not allowed
r(101);
. table male [w=wgt] /// table defaults to freq weights, too
(frequency weights assumed)
may not use noninteger frequency weights
r(401);
. table male [aw=wgt] /// aweights gives you the "wrong" answers,
through rouding off to integers
----------------------
male | Freq.
----------+-----------
0 | 3
1 | 3
----------------------
. table male [iw=wgt] /// iweights give you the "right" answers
----------------------
male | Freq.
----------+-----------
0 | 3.4
1 | 2.6
----------------------
. tab male [w=wgt]
(frequency weights assumed)
may not use noninteger frequency weights
r(401);
. tab male [aw=wgt] /// aweights with tab gives "right" answers
male | Freq. Percent Cum.
------------+-----------------------------------
0 |3.400000002 56.67 56.67
1 |2.599999998 43.33 100.00
------------+-----------------------------------
Total | 6 100.00
. tab male [iw=wgt] /// iweights with tab gives "right" answers, but
with different rounding!
male | Freq. Percent Cum.
------------+-----------------------------------
0 | 3.40000004 56.67 56.67
1 | 2.60000002 43.33 100.00
------------+-----------------------------------
Total | 6.00000006 100.00
. log close
Again, not sure the logic of some of these differences, for these
perhaps the most simple of commands.
I doubt there is much call for an nw option (naive weight)? But
otherwise for some analysis one is reduced to multiplying and/or
rounding off weights to get the point estimates that the data
depositors/creators tell you that you should be getting (i.e. the ones
in their report). Such as:
gen wgt2=wgt*10
compress
. tab1 male [w=wgt2] /// Right proportions, wrong 'bases'
(frequency weights assumed)
-> tabulation of male
male | Freq. Percent Cum.
------------+-----------------------------------
0 | 34 56.67 56.67
1 | 26 43.33 100.00
------------+-----------------------------------
Total | 60 100.00
Surely there should be something better than this?
Steve
Date: Wed, 10 Mar 2004 23:29:46 -0500
From: Richard Williams <[email protected]>
Subject: Re: st: non-integer frequencies?
At 09:49 PM 3/10/2004 -0600, ACHINTYA RAY wrote:
>Sample surveys oftentimes provide weights to convert sample estimates
into
>representative population figures. Sometimes such frequency weights
are not
>integers (For example, National Health and Nutrition Examination Survey
>III). It seems that Stata can only deal with integer frequency
weights. Is
>there a solution? The best that I can do right now is to take the
nearest
>integer to the non-integer frequencies. This method seems rather
adhoc. Any
>help will be deeply appreciated.
I think iweights will work, at least if the command allows the use of
iweights. e.g. I just tried
. sum income
Variable | Obs Mean Std. Dev. Min Max
- -------------+--------------------------------------------------------
income | 500 27.79 8.973491 5 48.3
. sum income [fw=1.2]
may not use noninteger frequency weights
r(401);
. sum income [iw=1.2]
Variable | Obs Weight Mean Std.
Dev. Min Max
-
-------------+-----------------------------------------------------------------
income
| 500 600 27.79 8.971993 5 48.3
However, remember that, for purposes of statistical inference, the
numbers
you get are wrong.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/