Dear statalisters,
A question that may have an easy answer (but that I don't know yet...):
I'm modeling a logistic regression on a pooled data set with both
census and survey data, for many years, in order to analize change
over time on a few variables.
Survey data comes from electronic datasets and are at household level
(about 10 thousands records/households per year), while census data
come from old books and are collapsed according to relevant variables
and their frequency distribution (and, of course, availability on
printed tables; about 10 millions households per year). For instance:
SOURCE/YEAR CLASS REGION FWEIGHT
------------------------------------------------------------------------------------
survey_1999 middle class South 1
survey_1999 middle class North 1
... ... ...
...
census_1951 service class North 234
census_1951 blue collars Center 1.145,434
... ... ...
...
Using fweight (that is =1 for survey households-level records) in such
a way is leading me to N= 28,000,000.
As a result, even parameters' estimates in the 0.0004 magnitudo are
supposed to be statistically different from 0. This clearly make no
substantive sense, but I'ld prefer to see in the output no N-inflated
significance tests.
One pratical rule that has been suggested to me is to divide census
records fweight by a constant (let's say 10,000). In such a way I'll
have a smaller N while preserving my variables distribution within
census year.
Unfortunately, following this rule, I'll have non integer fweights
that are not handled by Stata. Other available type of weights in
Stata are not related to frequencies.
Do you have any idea on how to handle this problem or how to deflate N
for significance test in logit & mlogit?
Any help would be greatly appreciated. Thanks in advance,
Teresio
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/