Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: correct confidence intervals of -mean- ?
From
[email protected] (Jeff Pitblado, StataCorp LP)
To
[email protected]
Subject
Re: st: correct confidence intervals of -mean- ?
Date
Mon, 08 Mar 2010 11:10:40 -0600
Dirk Enzmann <[email protected]> is comparing the confidence
intervals from -mean, over()- with those produced by hand using -collapse-:
> Very carefully I want to ask: Are the confidence intervals given by
> -mean- really correct?
>
> Below I compare the results of -mean- with the results of a different
> procedure:
>
> * --------------------------------------
> sysuse auto, clear
> mean price, over(rep78)
>
> local df = e(df_r)
> display "degrees of freedom = n-1 = `df'"
> * Note that here df is the number of the total sample - 1!
>
> * alternative route:
> collapse (mean) mprice=price (sd) sdprice=price (count) nprice=price if
> (rep78 < .), by(rep78)
> * calculate the confidence intervals the -mean- way:
> gen ci95la = mprice - invttail(`df',.025)*sdprice/sqrt(nprice)
> gen ci95ua = mprice + invttail(`df',.025)*sdprice/sqrt(nprice)
> * calculate the confidence intervals a different way:
> gen ci95lb = mprice - invttail(nprice-1,.025)*sdprice/sqrt(nprice)
> gen ci95ub = mprice + invttail(nprice-1,.025)*sdprice/sqrt(nprice)
> * compare both sets of confidence intervals (..a vs ..b):
> list
> * --------------------------------------
>
> My question: Which procedure is correct?
Tirthankar Chakravarty <[email protected]> replied:
> I assume the behaviour you are looking for is given by:
> mean price if rep78==1
> etc.
>
> The reason for the CIs being the way they are in the output of -mean-
> is because the degrees of freedom takes into account the calculation
> of a covariance matrix with observations in other subpopulations
> (albeit with covariance restricted to zero). The manual makes this
> clear for the estimation of means, but not for the covariance
> estimates. Just a guess though.
-mean, over()- computes multiple subpopulation means and estimates their
standard errors simultaneously, posting the means in -e(b)- and the
variance-covariance matrix of the estimated means in -e(V)-.
The degrees of freedom here are those associated with -e(V)-, which was
computed using the entire estimation sample. With the auto dataset, this
number is 68 = 69-1, since there are 5 missing values in -rep78- 69 = 74-5.
Tirthankar point out that the covariance is restricted to zero. Actually the
covariance are zero by construction in the above example. Non-zero
covariances are only possible with clustered data for this situation, provided
the subpopulations are not nested within the clusters or vice versa. With
clustered data, the degrees of freedom would become C-1, where C is the number
of clusters.
Kit Baum <[email protected]> also replied:
> These are the same standard errors of mean reported by
>
> tabstat price,by(rep78) stat(mean sd n semean)
>
> He wonders whether the DF used in calculating s.e.(mean) should be that of
> the full sample. I think that -mean- and -tabstat- are both using the
> notion that you have a model y = mu + \epsilon, where var(\epsilon} is a
> population parameter. Thus the variance of \epsilon is a constant for all
> subsamples, and when you calculate s.e. mean, you use the sqrt of that
> common variance and divide by the sqrt(sample size) of the subpopulation.
> You can see that is being done by -tabstat- by comparing the sd, n and
> semean columns.
>
> What does surprise me is that the CIs generated by these methods differ so
> widely from those computed by
>
> reg price i.rep78
> margins rep78
>
> The differences are not just a small-sample/large-sample adjustment of the
> Root MSE. If you take apart the VCE of a regression of price on all five
> dummies, no constant term, you find a diagonal matrix containing the
> inverses of the respective sample sizes, so the difference has to lie in the
> computation of \hat{sigma2} which multiplies inv(X'X).
Each subpopulation has its own variance, so multiple -regress- fits are
required.
The only way to get -regress- to reproduce the results from -mean, over()- is
to use -suest-, which requires us to use -[pw=1]- in order to trick -mean-
into producing the robust version of -e(V)-.
***** BEGIN:
. sysuse auto
(1978 Automobile Data)
. svyset _n
pweight: <none>
VCE: linearized
Single unit: missing
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>
. gen byte touse = !missing(rep78)
. forval i = 1/5 {
2. quietly regress price if rep78 == `i'
3. est store r`i'
4. }
. suest r1 r2 r3 r4 r5
Simultaneous results for r1, r2, r3, r4, r5
Number of obs = 69
------------------------------------------------------------------------------
| Robust
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
r1_mean |
_cons | 4564.5 263.1901 17.34 0.000 4048.657 5080.343
-------------+----------------------------------------------------------------
r1_lnvar |
_cons | 12.51745 .3561436 35.15 0.000 11.81942 13.21548
-------------+----------------------------------------------------------------
r2_mean |
_cons | 5967.625 1192.433 5.00 0.000 3630.499 8304.751
-------------+----------------------------------------------------------------
r2_lnvar |
_cons | 16.36588 .6505966 25.16 0.000 15.09073 17.64102
-------------+----------------------------------------------------------------
r3_mean |
_cons | 6429.233 637.4178 10.09 0.000 5179.917 7678.549
-------------+----------------------------------------------------------------
r3_lnvar |
_cons | 16.33535 .284271 57.46 0.000 15.77819 16.89251
-------------+----------------------------------------------------------------
r4_mean |
_cons | 6071.5 394.4743 15.39 0.000 5298.345 6844.655
-------------+----------------------------------------------------------------
r4_lnvar |
_cons | 14.88804 .2665945 55.85 0.000 14.36552 15.41055
-------------+----------------------------------------------------------------
r5_mean |
_cons | 5913 757.488 7.81 0.000 4428.351 7397.649
-------------+----------------------------------------------------------------
r5_lnvar |
_cons | 15.73862 .4663947 33.75 0.000 14.82451 16.65274
------------------------------------------------------------------------------
. est store suest
. mean price [pw=1], over(rep78)
Mean estimation Number of obs = 69
1: rep78 = 1
2: rep78 = 2
3: rep78 = 3
4: rep78 = 4
5: rep78 = 5
--------------------------------------------------------------
Over | Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
price |
1 | 4564.5 263.1901 4039.312 5089.688
2 | 5967.625 1192.433 3588.161 8347.089
3 | 6429.233 637.4178 5157.286 7701.181
4 | 6071.5 394.4743 5284.339 6858.661
5 | 5913 757.488 4401.456 7424.544
--------------------------------------------------------------
***** END:
When -suest- combines the results from multiple model fits, it resets the
scores for out of sample observations to zero, effectively treating the
estimation sample from each model fit as its own subpopulation. This is
exactly what is required in order to reproduce whan -mean, over()- is doing.
--Jeff
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/