|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Question about svyset command
--
Michael, I would say that if you intend to present p-values or to
state whether coefficients are "statistically significant", you are
shifting to a super-population point of view. Recall my paraphrase
from Cochran, that in a finite population, null hypotheses are
always false (except by rare chance; the difference in two
proportions is always non-zero; odds ratios are always different from
1. Thus every null hypothesis is rejected a priori. Hypothesis
testing makes sense only in the super-population setting.
I admit there is a gray area: how do you choose a good model, even
for descriptive purposes, without testing hypotheses about what
should and should not be in the model? My own take is that only
large effects should be in a descriptive model, so that what to
include will usually be clear. If a model is used for descriptive
purposes, whether small effects ( e.g. factors with odds ratios of
1.1) are in or out shouldn't matter much.
-Steve
On Feb 19, 2009, at 3:43 PM, Michael I. Lichter wrote:
I agree with Stas about the vital importance of defining the target
population.
Steven, however, is making me more confused about the difference
between inferences about finite populations vs. those about
superpopulations. I'll use my own study as an example.
I'm analyzing results from a survey of physicians regarding health
information technology (HIT) adoption. The survey was stratified
and a couple of the strata had large sampling fractions (like 1/3
and 1/8). My target population is all primary care physicians
delivering patient care during a specific interval in time--and the
interval in time is meaningful, because I expect HIT adoption
levels to be different (higher) today than they were back when the
data were collected. The target population and the (list) frame
population are undoubtedly different for a variety of reasons,
including the inherent near-impossibility of maintaining a complete
and accurate list of any large population. Still, I think I'm
interested in a finite population of actual physicians practicing
at a specific point in time, not a theoretical, infinite
superpopulation. Am I right?
I want to know about (a) current adoption rates by stratum
(estimating proportion & variance), (b) differences in adoption
rates across strata at this particular point in time (e.g., using
chi-square), (c) the general relationship between various
predictors (or covariates) and adoption (e.g., using logistic
regression). Are the first two finite population objectives and the
last a superpopulation-related objective, so that variances should
be estimated one way for the first two and a different way for the
third?
Thanks.
Michael
Steven Samuels wrote:
--
Thomas could generalize to the entire US in 2005. According to
http://www.icpsr.umich.edu/cocoon/NACJD/STUDY/23862.xml he is
omitting from his data 45 strata that covered the rest of the
country.
I actually agree with Stas. I do think that there are uses for
regression and comparisons with fpc's in descriptive studies. I
once analyzed Behavioral Risk Factor Surveillance System (BRFSS)
data in California, and characterized historical changes smoking
prevalence with a regression line. It fit pretty well.
I also would favor logistic regression and log-linear modeling as
smoothing techniques to economically describe a population.
Confidence intervals (with fpc's) for differences between two
proportions can also be informative; one might want to know "How
different were the proportions in that population at that time?".
In my experience, though, most investigators who do regressions do
not intend their analyses to be descriptive only. Until Thomas
tells us the purpose of his study, we will not really know what to
advise.
-Steve
On Feb 19, 2009, at 1:24 PM, Stas Kolenikov wrote:
Adding to the previous comments:
In all likelihood, your results are only generalizable to those most
populous counties, as they are probably large metropolitan areas.
You
would need to think very carefully about what the population is to
which the results are generalizable. Your superpopulation, if you
can
think of one, would be all potential trials in these and similar
large
counties. I would imagine that in a 3000 people county in Idaho,
people won't be suing each other as furiously as somewhere in New
Jersey or California, as there is plenty of land to live on... but
that's something for you to clarify.
Hence, just like Michael, I would disagree with Steven about
ignoring
fpc so happily. They would affect your standard errors, correctly
showing that you got more than half of your total finie
population. If
you had all of your population, you would have a census logistic
regression, which would be just some sort of the line saying where
your 0s and 1s are. Now, if you had a census regression, what would
standard errors stand for? On one hand, you've got all possible
observations, so there is no uncertainty left -- the
sampling/randomization/design variance is zero. But if you are
thinking about the social process that has created those
observations
(trials), then you can still think about model variances that should
be on the scale of 1/N -- and to get these, you would need to ignore
fpc.
Your design specification thus depends on which variance you want to
estimate. With census regression, your are saying, "There is a
line of
best fit, and I am prepared to find out it does not fit the data
perfectly, but if my goal is to get as close to that line of best
fit
as possible, then my sample logistic regression is the answer". That
line of best fit is a well defined population concept; whether it
makes a substantive sense or not -- that's certainly open to
interpretation. With a superpopulation model, you are saying, "I
know
perfectly well that these and only these factors affect the
probability of observing that post-trial motion, and they enter the
logistic equation linearly, and all that." Your results will only be
as good as your model, and you are putting a lot of trust in correct
specification there.
On Wed, Feb 18, 2009 at 11:04 PM, <[email protected]> wrote:
Iâm a beginner Stata user and have a question about the svyset
command in
Stata that I hope someone can help me with.
For some background, I'm engaged in a logistic regression model
that
examines the likelihood of either a plaintiff or defendant
filing a post
trial motion. The database I'm working with is the Civil Justice
Survey of
State Courts (CJSSC). The CJSSC provides case level data for all
tort,
contract, and real property trials conclude in a sample of 46 of
the
nation's 75 most populous counties in 2005. Data are collected
on about
8,000 trials in these 46 counties which are weighted to
represent about
10,500 trials concluded in the nation's 75 most populous
counties. I
understand that one of the nice features of Stata is that it
allows you to
take into account the sampling structure of a dataset when doing
logistic
regression modeling. Here is the Stata code that I used to take
in account
the sampling structure of these civil trial data:
svyset sitecode [pweight=bwgt0], strata(strata) fpc(fpc1) ||
su2, fpc(fpc2)
Where
Sitecode = County where the civil trial took place
Bwgt0 = Weights to weight the data from 46 to the 75 most
populous counties
Strata = Strata where the counties are located. The dataset has
5 strata
fpc1 = The probability of a county appearing in the sample. For
example, a
county with a weight of 2 would have a 50% probability of
appearing in the
sampl
e
su2 = Unique identifier that identifies the trials that occurred
in each of
the 46 counties
Fpc2 = 1 for all 8,000 trials disposed in the 46 counties. I
gave fpc2 a
value of 1 because I wanted to tell Stata that the trials had a
100%
probability of showing up in these 46 counties.
I think that I got the part of this programming that deals with
the first
level of the sample design correct. It's the second level that
I'm having
some problems with At the second level of the sample design, I'm
trying to
correct for the fact that I have data for every civil trial
concluded in the
46 counties. Basically, I want to tell Stata that part of this
sample is
actually a census of all trials concluded in the 46 counties in
2005. I
understand Stata has a finite population correction command that
takes into
account the census like format of these data. The logistic
regression
results were the same irrespective of whether I used the 1st or
2nd stages
in the sample design. I think this is telling me that Stata is not
correcting for the census like aspect of this sample. Can anyone
give me
some guidance as to whether I'm correctly taking into account
the sampling
structure of these data. In particular, I would like to know
whether I'm
using the fpc2 factor correctly. Any assistance you could give
on this
matter would be very much appreciated.
Thanks
Thomas Cohen
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
--
Michael I. Lichter, Ph.D.
Research Assistant Professor & NRSA Fellow
UB Department of Family Medicine / Primary Care Research Institute
UB Clinical Center, 462 Grider Street, Buffalo, NY 14215
Office: CC 125 / Phone: 716-898-4751 / E-Mail: [email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/