Dear Laurel, Nick, Roger, Joao Pedro, Giogio and Todd,
Many thanks for your comments. To put things in perspective, the presenter
was studying new maize varieties and sought to identify some socio economic
factors that may explain the adoption of these varieties. All respondents in
her sample grew some maize (traditional, improved or both) so her dependent
variable was area under improved varieties (which would then be handled
easily in a censored regression framework or better still as a corner
solution outcome). However, she argued that the area allocated needed to be
adjusted for total area under maize (if one has 1 acre and allocated 0.5
acres to the maize then in terms of adoption, this should not be the same as
someone with 10 acres of maize land but also allocates 0.5 acres) hence the
dependent variable was total area under new maize/ total maize area (hence
the proportion).
From Laurels email, it would imply that all the independent variables should
also be divided by the maize area, while Nicks email points out (correctly)
that while the dependent variable lies between 0 and 1, using OLS does not
guarantee that the predicted values of y will lie between 0 and 1 (which is
one of the main arguments against the Linear Probability Model). Roger
points to a binary dependent variable however the dependent variable here is
not quite binary. Joao Pedro suggests something that the presenter actually
did, while I still need to think thru Giorgios suggestion and I am just
going to read thru the paper suggested by Todd
In the light of the "added flesh" to the problem, I would appreciate your
comments on the best way to proceed (for example, would just including the
total maize area as one of the independent variables be a sufficient
control)
If the Y-variable is a proportion rather than a binary variable, then you
can still use either -regress- with Huber variances, or -glm- with identity
link and binomial family, or even -glm- with log link and binomial family
if you want multiplicative effects. The -glm- command will warn you that
your Y-variable is not binary, but will still do as it is asked. The main
problem with homoskedastic (equal-variance) linear regression is that, if
the Y-variable is a proportion, then the conditional variance is not likely
to be independent of the conditional mean, because proportions sampled from
a distribution with a mean near 0.5 can vary more than proportions sampled
from a distribution with a mean near 0 or 1. The -family- option of -glm-
simply optimises the estimation under a particular assumption about
mean-variance relationship, in order to minimize the width of the
confidence intervals if that assumption is true. If you also use the
-robust- option, then your standard errors will still be consistent, even
if you do not guess the mean-variance relationship right first time. I
myself would probably not simply use area under new maize as the Y-variable
and area under total maize as an X-variable, because I would expect the
effect of total maize area on area under new maize to be multiplicative
rather than additive.