First of all many thanks for all the replies.
I will check the distribution of the (employment) size of the companies in
my sample. However it does not seem very likely to me that this distribution
will prove disturbingly nonnormal. My sample is a selection, namely all the
'product innovators', of the Dutch version of the Community Innovation
Survey (CIS). The data was gathered by the Dutch central bureau of
statistics and they always (deliberately) create a sample with a large
spread of sizes of companies.
Besides the fact that I don't expect the distribution to be very skewed or
otherwise disturbed, there is also the fact that of the 5500 missings I
imputed, more then 2000 turned out (strongly) negative.
Thus it seems that we either completely neglect some assumptions for -ice-,
the data is not appropriate for -ice- or something is going internally wrong
with -ice-.
Since the data is only accessible at the bureau of statistics I can apply
your comments this Friday at the earliest.
Many thanks again and greetings,
Ren�
-----Oorspronkelijk bericht-----
Van: [email protected]
[mailto:[email protected]] Namens Nick Cox
Verzonden: Wednesday, February 13, 2008 7:00 PM
Aan: [email protected]
Onderwerp: RE: st: Issue with multiple imputation -ICE-
Control over how percentile ranks (also no doubt known under other
names) are calculated is easily possible, as detailed within
FAQ Calculating percentile ranks or plotting positions
7/02 How can I calculate percentile ranks?
How can I calculate plotting positions?
http://www.stata.com/support/faqs/stat/pcrank.html
Then you need just one more function to get normal scores.
Maarten buis
--- Mark Lunt <[email protected]> wrote:
> ICE assumes that continuous variables are normally distributed: if
> that is not the case, impossible values can appear. In particular, if
> you have lots of companies with a few employees and a few companies
> with lots of employees, ICE will impute negative numbers of
> employees. One possible solution is to use the "match" option of ICE.
Good point. An alternative would be to take the logarithm of the number
of employees.
> Alternatively, I have written some ado-files which convert variables
> to normal-scores and back: you can convert to normal scores (which
> are normally distributed), perform the imputation on these
> variables, then convert back to your original distribution.
I have had a quick look at this command and it would seem that you use
the rank of each observation and transform that as if it came from a
normal distribution. I think that that is too strong a transformation,
as you throw away all information about the distances between values
and only use the rank. This is most clearly visible when two or more
observations have the same value. In the way you programed this
procedure they are given different ranks, and thus different values on
your new variable:
*--------- begin example ---------
sysuse auto, clear
nscore rep78, gen(gauss)
twoway scatter gauss rep78
*---------- end example ----------
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/