Bernadette-
Without knowing how you chose your subsamples, it's hard to say why you
obtained such huge variation in your p-values. If your data is arranged in
large clusters of all caesarean or no caesarean cases, or in large clusters
of similarly-valued predictor variables, your subsamples may not be
representative of the whole dataset.
Instead of xtlogit, try xtgee with a binomial distribution and logit link. I
believe it will work on all your data at once. xtlogit is a
maximum-likelihood approach assuming a random-effects model and is
computationally intense. xtgeee uses the method of generalized estimating
equations (GEE) with a robust estimator of variance allowing for the
clusters. It only asumes a certain correlation structure for observations
within clusters - the default is an equicorrelated structure.
Hope this helps.
Al Feiveson
-----Original Message-----
From: Alves, Bernadette [mailto:[email protected]]
Sent: Friday, July 19, 2002 7:41 AM
To: '[email protected]'
Subject: st: request for help - multi-level modelling with a big dataset
using xtlogit
I'm a student looking for help with my MSc dissertation looking at factors
associated with delivery by caesarean section. It's an analysis of a
database of about half a million records of women who gave birth in
hospital. I am using logistic regression and because my data are naturally
grouped, I'm using a multi-level approach to take account of the correlation
between women in the same hospital. I am therefore using xtlogit (rather
than logit). I find that I cannot run xtlogit with my entire 500,000
records - stata comes back with an error saying that it needs to be able to
set matsize to approximately 18,000. Unfortunately the matsize limit for
stata 7.0 is 800.
I then took a 4% sample (approximately 20,000 records ) which is the largest
that stata can cope with at a matsize of 800. But, and here's the weird
thing that I need help with.... The parameter estimates are very dependent
on the sample I take. Sometimes I get a p-value of 0.05, for other samples I
get a p-value of 0.7. Here's an example of what I do to test whether
xdelmid is a predictor of emergency caesarean section.
sample 4 /* this give me the 4% sample */
xi: xtlogit emerg i.gestat i.age i.xdelmid, pa corr(exch) robust
i(provid)
testparm _Ixdel* /* this does a wald test on xdelmid */
Taking 10 different 4% sample, I find my estimates differ considerably and
my p-values range from 0.04 to 0.71.
Why can't stata cope with the full dataset and why are the parameter
estimates so sensitive to the sample taken?
I would be extremely grateful if someone could help me with this.
Bernadette
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/