I am trying to refresh my statistical knowledge and at the same time refresh
my experience with Stata (I currently run version 6, but will upgrade to 8).
Now I desperately need some guidance on the best way to approach an analysis
of a data set I have.
The data come from a study where 19 pregnant sows (16 with infected
foetuses, 3 infected with healthy foetuses) have been tested for antibodies
(expressed as an optical density (OD)) repeatedly during gestation. Each sow
has been tested 8 times.
Basically I want to know if OD and the day in gestation (DIG) when the sow
was tested can predict whether the foetus is infected or not. I anticipate
that there may be an interaction between OD and DIG so I also want to
include an interaction term.
Typing:
logistic inf od*dig, cluster(id) asis
only gives estimates for the main effects (od and dig) and not for the
effect modifier that I wanted to test. Why is this?
However, if I in a previous step create an interaction term by
generate od_dig = od*dig, and then type
logistic inf od*dig, cluster(id) asis
Stata somehow identifies od_dig as something I want in the model.
Something I had not expected, but OK.
However, I also have a multicollinearity problem (high correlation between
the interaction term and the main effects) so I experimented to try and
reduce it by centering od (created a new variable called od2 = od -
mean(od)) and a new interaction term od2_dat = od2*dig).
Then at some point in time I ran my first version again:
logistic inf od*dig, cluster(id) asis
and got the message:
Note: od dropped due to collinearity.
Note: od2_dat dropped due to collinearity.
Hmm. I didn't ask for od2_dat to be in the model?? How come it was included?
(I understand that I probably lack some vital Stata info here, so please
excuse me!)
Then I typed:
logistic kdpi od2 andigat od2_dat, cluster(id) asis
which appears to work.
Apart from this interaction confusion, I am still uncertain if I have chosen
a reasonably correct way of analysing these data, considering that I have
repeated measurements on individuals. I hope that the fact that I specified
the cluster option will account for that, but does it properly? Somebody
warned me about the dependence between mean and variance for binomial
dependent variables, that adjusting the variance could still lead to biased
point estimates... but I have not found a discussion on that topic in the
manual (so far).
I have tried to use xtgee as well, but I can not make it converge..
Enough questions for now, I hope someone out there can help!