Apparently, I need an infinite sample size to obtain a perfect
theoretic standard normal distribution with zero skew, variance
exactly 1 and kurtosis exactly 3.
Consider the following code:
version 8.2
set memory 400m
* 10,000,000 obs!:
set obs 10000000
generate double perfuniform = _n/(_N+1)
generate double perfnorm = invnorm(perfuniform)
summarize perfnorm, detail
The last command yields:
perfnorm
-------------------------------------------------------------
Percentiles Smallest
1% -2.326346 -5.199338
5% -1.644853 -5.068958
10% -1.281551 -4.991217 Obs 10000000
25% -.6744897 -4.935367 Sum of Wgt. 10000000
50% 6.96e-17 Mean 7.79e-18
Largest Std. Dev. .9999986
75% .6744897 4.935367
90% 1.281551 4.991217 Variance .9999971
95% 1.644853 5.068958 Skewness 3.23e-16
99% 2.326346 5.199338 Kurtosis 2.999924
Now the skew is basically zero, but the kurtosis and variance are not
precisely Gaussian. For 20,000 observations, the above would give
similar skew, kurtosis=2.987949 and variance = .9991686. (n=2000:
small skew, kurt=2.937, Var=.99386)
I see a number of possible approaches to this issue.
1) Include the numbers 1 and 0 in the uniform distribution. This
doesn't work: the invnorm function simply produces missing values for
those observations.
2) Take a harder look at my allegedly perfect uniform distribution.
Is this where I went wrong?
3) Try another statistics package?? (Somehow I doubt that this would help.)
4) We can always rescale the dispersion. Divide the resulting
distribution by the standard deviation. Consider this addition to the
10,000,000 obs example:
gen double perfnorm2=perfnorm/r(sd)
sum perfnorm2, detail
In that case, skew stays close to zero, var=1 and kurtosis is
unaffected - still less than 3.
5) Set up a loop and iteratively modify a couple of observations to
improve the kurtosis. Set the proper variance with 4). I
experimented with that: one problem is that this patch doesn't really
produce a perfect theoretic Gaussian distribution, but rather a
kludge.
6) Ignore the problem and characterize it as an oddity. Note though
that this may imply that normally distributed standard random
variables in Stata will not be perfectly standard normals *on
average*: even their variance will be a little lower without
adjustment. These effects would presumably be swamped by ordinary
sampling randomness.
More generally, I wonder whether this is the proper way to generate a
standard normal random variable in Stata:
gen double perfuniform = _n/(_N+1)
gen double perfnorm = invnorm(perfuniform)
sum perfnorm
scalar adjsd=r(sd)
gen double randnorm = invnorm(uniform())/adjsd
This procedure would not correct the kurtosis though.
Might this effect matter in small samples? Sure: for n=20, the
unadjusted variance of the theoretic standard normal would be .794,
which seems low.
But recall point 2): I'm probably missing something. For example, the
uniform random number generator would not produce the sort of evenly
spread-out numbers in my perfuniform variable. So this adjustment may
be inappropriate. Monte Carlo work might clarify matters, as would
additional conceptual insight. In short, I would avoid using this
adjustment without further analysis.
7) Or perhaps I *am* generating perfectly normal distributions, but
that theoretic normals only have kurtosis -> 3 as n -> infinity. I
don't know.
Thanks for your attention.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/