On 9/15/07, Erasmo Giambona <[email protected]> wrote:
> Thanks very much Stas. The problem is that the estimate goes from a
> p-value of less than 0.01% to a p-value of 19% so I am in the dilemma
> of trying to figure out which is most reliable. I would truly
> appreciate a little bit more of your time. Below you suggest to look
> at the confidence intervals. Are you suggesting to compare the
> bootstrap intervals with the sandwich intervals? Would it make sense
> to check what happens if I increase my repetitions from 1000 to say
> 5000 given that I have more than 1600 clusters?
> I would appreciate any further comments on this.
I don't think increasing the number of subsamples from 1000 to 5000
would change things much, although of course you could try it.
Frankly, I'd be at a loss... both methods are justifiable, and so if
they diverge that much, I'd say that none of them is truly reliable.
If anything, I would expect this sort of instability from a variable
that does not change much within a cluster, although the sandwich
estimate should catch that.
As a theoretical possibility, there might be identification issues, so
that some bootstrap samples hit empirically underidentified situation
-- say all subsampled clusters have a value of the difficult variable
equal to 1, while there are other clusters left out that have a value
of 0, so everything is OK in the complete sample. Then the parameter
is perfectly collinear with the constant for that bootstrap subsample,
and thus underID. If that, or something like that, is plausible, then
you can either catch that situation with -reject- option, or stratify
your sample by that "slowly varying" variable.
It would also be interesting to see how this stuff behaves if you
subsample a small fraction of your clusters -- say 100 or 200 out of
1600. This would call for rescaling by the sqrt of the effective
sample sizes, and I don't know if Stata does this by default. This
trick is known to rectify a few difficult situations when you
bootstrap a pivotal quantity (t-statistic rather than the coefficient
estimate itself).
BTW what are your clusters? I have not come across a situation with
complex survey designs where that would be a reasonable outcome. You
should cluster at the highest level possible, which might be regions
rather than households if the first stage of sampling was at the level
of that region.
--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: Please do not reply to my Gmail address as I don't check
it regularly.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/