Hi,
Consider the following problem: I want to regress age on sex, but my
dataset was collected from four sites, so I'd like to control for site. I
could do:
xi: reg age i.sex i.site
But graphical examination suggests that different sites had different
variances. What's the solution?
I've done some research already, and it seems that if I use either -vwls-
with the sd option, or -reg- with the aweight option, I'll be able to get
round this to a certain extent. The problem is I'll first need to estimate
the variance in another way, most likely by obtaining the residuals from
an OLS regression first.
Besides the rather long-winded way of this approach, theoretically the
estimates won't be optimal because the variances estimates are not based
on the weighted regression. But still, if this is the best way to go about
the problem, I'll probably use it. One question is: Can the use of aweight
be readily extended to more complicated models such as -glm- or -xtgee- to
account for heterogeneity in variances? If so, how?
One of the great features of Stata is its robust option in many estimation
commands. Theoretically in normal linear regression, it replaces the
variance matrix of our error (e) with an empirical one based on the
residuals. I foresee that one solution to my problem would be to create a
variance matrix that is half way between the OLS and this empirical one,
that is one that has its residuals averaged within each group (site). One
problem with the robust option is that often if my subgroup size is too
small, it gives rubbish estimates of Standard error. I wonder if this
could be a solution to this too. Has anyone done methodological
investigation into this technique?
Yours,
Tim Mak
PS this query has been posted in allstat
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/