I'm a student looking for help with my MSc dissertation looking at factors
associated with delivery by caesarean section. It's an analysis of a
database of about half a million records of women who gave birth in
hospital. I am using logistic regression and because my data are naturally
grouped, I'm using a multi-level approach to take account of the correlation
between women in the same hospital. I am therefore using xtlogit (rather
than logit). I find that I cannot run xtlogit with my entire 500,000
records - stata comes back with an error saying that it needs to be able to
set matsize to approximately 18,000. Unfortunately the matsize limit for
stata 7.0 is 800.
I then took a 4% sample (approximately 20,000 records ) which is the largest
that stata can cope with at a matsize of 800. But, and here's the weird
thing that I need help with.... The parameter estimates are very dependent
on the sample I take. Sometimes I get a p-value of 0.05, for other samples I
get a p-value of 0.7. Here's an example of what I do to test whether
xdelmid is a predictor of emergency caesarean section.
sample 4 /* this give me the 4% sample */
xi: xtlogit emerg i.gestat i.age i.xdelmid, pa corr(exch) robust
i(provid)
testparm _Ixdel* /* this does a wald test on xdelmid */
Taking 10 different 4% sample, I find my estimates differ considerably and
my p-values range from 0.04 to 0.71.
Why can't stata cope with the full dataset and why are the parameter
estimates so sensitive to the sample taken?
I would be extremely grateful if someone could help me with this.
Bernadette
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/