|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Variance estimation with clusters
---
I really like Austin's idea of using weights that represent person-
years. Either weighting scheme will run into trouble if the
Probability that a worker is observed in year 2, given observation
year 1, is correlated with analysis variables. Suppose, for example,
one is studying occupational health. If some workers leave their jobs
before the year 2 survey because of health problems, those with two
years of data will be healthier. This is the well-known 'healthy
worker effect'.
-Steven
On Nov 8, 2007, at 11:00 AM, Austin Nichols wrote:
Steven makes some good points. I have a slightly different take:
1. Use the -fpc- option, but understand what it means. Imagine you
"sampled" w/o replacement 100% of establishments and workers in a
population; with the fpc's, all standard errors would be zero. This
is as it should be; the svy SEs in a regression using the population
are zero, because svy SEs represent deviations around the population
value (not Fisher-Neyman notions of deviations about what might have
been observed in the population with a different random sprinkling of
regressors on individuals).
2. svy + panel = trouble. If you want to run a fixed-effect
regression, consider -areg- which allows pweights that vary over time
and a -cluster- option.
3. I would use the time-specific weights which measure the number of
person-years each observation represents in the population of workers
in the two years. The population is then not people, but people*time.
On 11/8/07, Steven Joel Hirsch Samuels <[email protected]>
wrote:
--
Maury:
I would would only add to Austin's good advice:
1. If you are doing regressions and hypothesis tests, do not use the
fpc terms. Imagined you had studied 100% of establishments and
workers in a population; with the fpc's, all standard errors would be
zero.
2. Stata's panel data and multi-level model -xt- commands will not
respond to -svyset-. For panel data analysis, the options
accommodating the survey design vary by command.
3. You should probably use the survey weights from year 1; but the
study documentation may have other advice. Obviously these weights
will not sum to the population size in either year 1 or year 2. If
the survey deliberately over-sampled a class of workers which is the
subject of your analysis (e.g. you wish to compare a minority to a
majority group, and the survey over-sampled the minority group), you
should probably ignore the survey weights altogether.
-Steven
On Nov 8, 2007, at 10:16 AM, Austin Nichols wrote:
Maury Gittleman <[email protected]>:
Just clustering on establishment is probably sufficient.
You can also specify two levels of clustering with -svyset- e.g.
webuse stage5a
svyset su1 [pweight=pw], fpc(fpc1) || su2
where su1 is your establishment id, fpc1 the number of distinct
employees in both years, and su2 is a person id.
Usually the second level of clustering is largely irrelevant. But
not always...
svyset su1 [pweight=pw], fpc(fpc1) strat(strat)
svy: reg yreg x?
est sto c1lev
svyset su1 [pw=pw], fpc(fpc1) str(str) || su2, fpc(fpc2)
svy: reg yreg x?
est sto c2lev
esttab *, mti
On 11/8/07, Gittleman, Maury - BLS <[email protected]> wrote:
Hello,
I'm have a question concerning stata's approach to estimating
standard
errors in the presence of clustered survey data. The survey I'm
using
collects information on individual wages, by first selecting
establishments at random, and then collecting information on
multiple
workers within each establishment. So, it is clear that, when I'm
running regressions, I need to cluster on establishment.
My question arises when I use two years of data from the same
survey.
For about 4/5 of the individuals, there will be data for two
years, and
I would expect that the correlation between the errors for any
given
individual will be higher than the correlation between the
errors for
two different individuals at the same establishment. My
thinking is
that I still want to define clusters by establishments, as the
variance
estimation is said to be robust to any arbitrary intra-cluster
correlation.
Is this the right way to go or is there an alternative approach
that
might be superior?
Thanks very much.
Maury
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
Steven Samuels
[email protected]
18 Cantine's Island
Saugerties, NY 12477
Phone: 845-246-0774
EFax: 208-498-7441
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
Steven Samuels
[email protected]
18 Cantine's Island
Saugerties, NY 12477
Phone: 845-246-0774
EFax: 208-498-7441
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/