Peter Graven <[email protected]> asks,
> How do you create a random variable given just the percentile
> statistics for a distribution? For example the 1%, 5%, 10%, 25%, 50%,
> 75,%, 90%, 95%, 99% percentiles are, respectively, 795, 1834, 2172,
> 2641, 3174, 3822, 4900, 5981, 8522. That is, I would like to create a
> variable with this distribution in my dataset.
What a great question! Stata really should have a function that will
do this for you, but it doesn't. So we'll have to write our own.
First, let's figure out how we are going to proceed. Forgive me for
being so didactic;I'm talking to myself. We have,
q .01 .05 .10 .25 .50 .75 .90 .95 .99
------------------------------------------------------
y 795 1834 2172 2641 3174 3822 4900 4981 8522
Let's imagining pulling a value u at random from uniform(). If that
value were .01, or .05, or ..., we know what our funtion should
return: 795, or 1834, or ...
What if we get a value between two of the predefined quantiles?
Well, we could assume the values that y can take on are fixed and
those values are 695, 1834, ..., but looking at those values, I
bet y is continuous, so let's linearly interpolate.
That leaves on more problem. What if we u=.002 or u=.995? Peter
didn't tell us the range of y, but we'll need to know it. I'm going
to pretend the table reads
q 0 .01 .05 .10 .25 .50 .75 .90 .95 .99 1
----------------------------------------------------------------
y 0 795 1834 2172 2641 3174 3822 4900 4981 8522 10000
So I'm asserting that y is bounded by [0, 10000). Peter can substitute
more reasonable end points.
Okay, I know what I need to code. We could write this as a Stata program --
an ado-file -- but the result would run slowly and be difficult to read. I'm
going to use Mata. Here's my program:
------------------------------------------------ myprog.do -------------------
capture mata mata drop u_empirical()
mata:
real colvector u_empirical(real scalar n, real vector q, real vector y)
{
real scalar i, k
real scalar u, a
real colvector result
if (length(q)!=length(y)) _error(3200)
if (length(q)<2) _error(3300)
result = J(n, 1, .)
for (i=1; i<=n; i++) {
u = runiform(1,1)
if (u==0) result[i] = y[1]
else {
for (k=2; 1; k++) {
if (q[k] > u) break
}
a = (u-q[k-1])/(q[k]-q[k-1])
result[i] = (1-a)*y[k-1] + a*y[k]
}
}
return(result)
}
end
------------------------------------------------------------------------------
Let's try it,
. do myprog
(output omitted)
. mata
: q = ( 0, .1, .5, .1, .25, .5, .75, .90, .95, .99, 1)
: y = ( 0, 795,1834,2172,2641,3174,3822,4900,5981,8522,10000)
: random = u_empirical(5, q, y)
1
+---------------+
1 | 891.0661438 |
2 | 3545.227971 |
3 | 3323.821994 |
4 | 3445.628493 |
5 | 3651.384132 |
+---------------+
: end
I think that's right. Actually, I'm pretty sure it's right because,
behind the scenes, I included the line -u, result[i]- at the end of the
outside for loop so that I could see the value of u chosen and verify that
that the program calculated the correct result.
Doubtlessly Peter wants these values a Stata dataset:
. drop _all
. set obs 50
obs was 0, now 50
. gen x = .
(50 missing values generated)
. set seed 20399
. mata
: st_view(myx=., ., "x")
: myx[.] = u_empirical(50, q, y)
: end
. list in 1/5
+----------+
| x |
|----------|
1. | 3954.497 |
2. | 1015.753 |
3. | 7914.299 |
4. | 1564.397 |
5. | 995.6785 |
+----------+
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/