Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: svy subpop option and e(sample)
From
Richard Williams <[email protected]>
To
[email protected], [email protected]
Subject
Re: st: svy subpop option and e(sample)
Date
Wed, 25 May 2011 09:56:30 -0500
Thanks Steven. That is more or less what I expected. It is still a
potential gotcha though. In non-svy analyses, e(cmd) tells you which
cases were used to estimate the coefficients (which are the same as
the cases used to estimate the standard errors). With svy, e(cmd)
basically just tells you which cases in the population did not have
missing data, whether they were in the subpopulation or not. If you
are not careful, things like predicted probabilities might be
computed for the entire population when you only wanted them done for
a subpopulation. It seems like the behavior of e(cmd) when used with
subpop should at least be mentioned in the documentation, along with
a tip or two on how to do post-estimation commands on the subpopulation only.
At 06:20 PM 5/24/2011, Steven Samuels wrote:
Just to elaborate: with sub-populations, the ratio estimator of a
mean with every sample member in numerator and denominator is
necessary because the sample size of the subpopulation is random,
not fixed. This extends to the regression estimators, as they are
functions of means. If you had use an -if- qualifier to restrict
the analysis to black==1, e(sample) would work as you expect; the
estimates would be the same; but the standard errors would be different.
Steve
[email protected]
This is expected behavior, Richard. Everyone contributes to the
standard error, whether in the sub-population or not. For example,
with n = 3 and X1 X2 in the supopulation and X3 not, let Z = 1 for
those in the subpopulation, 0 if not.
Then the mean of X is estimated by (W1 + W2 + W3/(Z1 + Z2 + Z3), a
ratio estimate, with Wi = Xi*Zi and Variation in the mean is
measured by variation between W's, which includes W3 =0.
Steve
On May 24, 2011, at 4:02 PM, Richard Williams wrote:
I've just noticed that the e(sample) option does not work the way I
expect it to when using svy and the subpop option. Specifically,
e(sample) codes everyone in the population as 1, whether they were
in the subpopulation specified or not. I guess I can sort of kind of
see a rationale for doing this (the whole population is used to
compute the standard errors) but it has the potential to screw up
your post-estimation analysis if you only wanted to do things with
(what you thought) was the subpopulation you expected.
The following illustrates this. There are only 1086 cases in the
subpopulation selected, but probabilities are computed for all
10,000 cases. That is, coefficients computed using only the black
subpopulation are used to compute probabilities for the entire population:
. webuse nhanes2f, clear
. svy, subpop(black): ologit health age female
(running ologit on estimation sample)
Survey: Ordered logistic regression
Number of strata = 30 Number of obs = 10000
Number of PSUs = 60 Population size = 113285074
Subpop. no. of obs = 1086
Subpop. size = 11189236
Design df = 30
F( 2, 29) = 29.87
Prob > F = 0.0000
------------------------------------------------------------------------------
| Linearized
health | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0452349 .0063482 -7.13 0.000 -.0581996 -.0322703
female | -.3975887 .1336441 -2.97 0.006 -.6705263 -.1246511
-------------+----------------------------------------------------------------
/cut1 | -4.427029 .2976634 -14.87 0.000 -5.034939 -3.819119
/cut2 | -2.97326 .2848889 -10.44 0.000 -3.555081 -2.391439
/cut3 | -1.347426 .2497407 -5.40 0.000 -1.857465 -.8373876
/cut4 | -.214417 .2857434 -0.75 0.459 -.7979829 .3691488
------------------------------------------------------------------------------
Note: 1 stratum omitted because it contains no subpopulation members.
. predict p1 p2 p3 p4 p5 if e(sample)
(option pr assumed; predicted probabilities)
(337 missing values generated)
. sum p1 p2 p3 p4 p5
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
p1 | 10000 .1362895 .0853605 .0286835 .3358028
p2 | 10000 .2341256 .0876537 .0835067 .3482424
p3 | 10000 .3367498 .0391877 .2327486 .3854614
p4 | 10000 .1639109 .0712129 .0549047 .2749383
p5 | 10000 .1289243 .0859123 .0284552 .3339704
I think one solution is to change the predict command to something like
predict p51 p52 p53 p54 p55 if e(sample) & black
predict p61 p62 p63 p64 p65 if e(sample) & `=e(subpop)'
But, are there others? Preferably simpler ones? And is this good
behavior for e(sample) in the first place?
-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME: (574)289-5227
EMAIL: [email protected]
WWW: http://www.nd.edu/~rwilliam
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME: (574)289-5227
EMAIL: [email protected]
WWW: http://www.nd.edu/~rwilliam
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/