[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Dealing with survey data when the entire population is also in the dataset

From	"Michael I. Lichter" <[email protected]>
To	[email protected]
Subject	Re: st: Dealing with survey data when the entire population is also in the dataset
Date	Sun, 26 Jul 2009 18:25:38 -0400

I guess Margo's real question is: If my null hypothesis is that there isno difference between a *sample* and a *population* with respect to thedistributions of two or more *categorical* variables, what is the mostappropriate way to test that hypothesis?

Austin proposed Hotelling's t-square, which is a global test of equalityof *means* for independent *samples*. This takes care of the multiplecomparisons problem, but doesn't fit Margo's needs because of level ofmeasurement (except, possibly, if the categorical variables aredichotomous or ordinal and can be arguably treated as continuous-ish)and because it is a two-sample test instead of a single-sample test.

Margo's problem is the same (I think) as the problem of comparing thecharacteristics of a realized survey sample against the knowncharacteristics of the sampling frame to detect bias. This is acommon-enough procedure, and frequently done using chi-square testswithout adjustment (correctly or not) for multiple comparisons. Theseare typically done as chi-square tests of independence, but since thecharacteristics of the sampling frame are *not* sample data, they shouldreally be goodness of fit tests. (Right?)

I don't claim to be a real statistician, and I don't claim to have areal answer, but I think that results from multiple chi-square tests,interpreted jointly (so that, e.g., a single significant result with arelatively large p-value would not be considered strong evidence ofdifference), would be convincing enough for most audiences.

By the way, for clarification, here what I was suggesting with respectto sampling and recombining the sample and population data:


-----
sysuse auto,clear
sample 50 if (foreign == 0)
sample 75 if (foreign == 1)
replace wt = 1/.75 if (foreign == 1)
replace wt = 1/.5 if (foreign == 0)
gen sample = 1
gen stratum = foreign
tempfile sample
save `sample'
sysuse auto,clear
append using `sample'
replace wt = 1 if missing(sample)
replace stratum = 2 if missing(sample)
replace sample = 0 if missing(sample)
svyset [pw=wt], strata(stratum)
----


Austin Nichols wrote:

Margo Schlanger<[email protected]> :
I think Michael I. Lichter means for you to -append- your sample and
population in step 2 below.  Then you can run -hotelling- or the
equivalent linear discriminant model (with robust SEs) to compare
means for a bunch of variables observed in both.  I.e.
.  reg sample x* [pw=wt]
in step 2b, not tabulate, with or without svy: and chi2.

On Fri, Jul 24, 2009 at 11:24 PM, Michael I.
Lichter<[email protected]> wrote:

Margo,

1. select your sample and save it in a new dataset, and then in the new
dataset:
a. define your stratum variable -stratavar- as you described
b. define your pweight as you described, wt = 1/(sampling fraction) for each
stratum
2. combine the full original dataset with the new one, but with stratavar =
1 for the new dataset and wt = 1 and with a new variable sample = 0 for the
original and =1 for the sample, and then
a. -svyset [pw=wt], strata(stratavar)-
b. do your chi square test or whatever using svy commands, e.g., -svy: tab
var1 sample-

Michael

Margo Schlanger wrote:

Hi --

I have a dataset in which the observation is a "case".  I started with
a complete census of the ~4000 relevant cases; each of them gets a
line in my dataset.  I have data filling a few variables about each of
them.  (When they were filed, where they were filed, the type of
outcome, etc.)

I randomly sampled them using 3 strata (for one strata, the sampling
probability was 1, for another about .5, and for a third, about .75).
I end up with a sample of about 2000.  I know much more about this
sample.

Ok, my question:

1) How do I use the svyset command to describe this dataset?  It would
be easy if I just dropped all the non-sampled observations, but I
don't want to do that, because of question 2:

2) How do I compare something about the sample to the entire
population, just to demonstrate that my sample isn't very different
from that entire population on any of the few variables I actually
have comprehensive data about. I could do this simply, if I didn't
have to worry about weighting:

tabulate year sample, chi2

But I need the weights.  In addition, I can't simply use weighting
commands, because in the population (when sample == 0), everything
should be weighted the same; the weights apply only to my sample (when
sample == 1).  And I can't (so far) use survey commands, because I
don't know the answer to (1), above.

NOTE: Nearly all the variables I care about are categorical:  year of
filing, type of case.  But it's easy enough to turn them into dummies,
if that's useful.


Thanks for any help with this.

Margo Schlanger


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


--
Michael I. Lichter, Ph.D. <[email protected]>
Research Assistant Professor & NRSA Fellow
UB Department of Family Medicine / Primary Care Research Institute
UB Clinical Center, 462 Grider Street, Buffalo, NY 14215
Office: CC 126 / Phone: 716-898-4751 / FAX: 716-898-3536

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Dealing with survey data when the entire population is also in the dataset
  - From: Ángel Rodríguez Laso <[email protected]>

References:
- st: Dealing with survey data when the entire population is also in the dataset
  - From: Margo Schlanger <[email protected]>
- Re: st: Dealing with survey data when the entire population is also in the dataset
  - From: "Michael I. Lichter" <[email protected]>
- Re: st: Dealing with survey data when the entire population is also in the dataset
  - From: Austin Nichols <[email protected]>

Prev by Date: Re: st: Graph Scale Issue with xtline
Next by Date: st: how to adjust standard error with -cluster- option by two dimensions?
Previous by thread: Re: st: Dealing with survey data when the entire population is also in the dataset
Next by thread: Re: st: Dealing with survey data when the entire population is also in the dataset
Index(es):
- Date
- Thread