I wrote:
local quit = 1
while (`quit') {
generate double randu`quit' = uniform()
sort randu`quit', stable
capture assert randu`quit' > randu`quit'[_n-1] in 2/l
if _rc `++quit'
else continue, break
}
--------------------------------------------------------------------------------
Now I remember: the sorting on random numbers needed to be hierarchical in
order to assure that the iterations would eventually end, especially with
the large dataset. What I ended up with was something more akin to
clear
set memory 100M
set obs `=2e6'
set seed `=date("2007-01-24", "ymd")'
generate long surrogate_id = _n
generate byte duplicates = 1
local pass 1
while (`pass') {
generate double randu`pass' = uniform() if duplicates
sort randu*, stable
replace duplicates = 0
replace duplicates = (randu`pass' == randu`pass'[_n-1]) ///
if !mi(randu`pass') & _n > 1
capture assert duplicates == 0
if _rc {
replace duplicates = 1 if (duplicates[_n + 1] == 1)
local pass = `pass' + 1
}
else continue, break
}
drop randu* duplicates
display in smcl as text "Number of passes: " as result `pass'
exit
This example (two million rows) takes two passes even with double-precision
random-number variables.
All this effort to explicitly rerandomize duplicate random numbers arose
when it seemed that "randomized" in Stata's documentation for -sort ,
stable- meant more "haphazard" and less "in a reproducible pseudorandom
sequence." (See the example below typed from the keyboard.) It might be
that -sort-'s randomization runs off a different seed. In any event, an
observation like the one below threw me, and I resorted to hierarchical
randomization in order to assure myself unambiguous reproducibility of the
sequence.
Joseph Coveney
. clear
.
. set more off
.
. set seed 1234567890
.
. set obs 20
obs was 0, now 20
.
. generate byte id = _n
.
. generate double randu = uniform()
.
. replace randu = randu[1] in 2
(1 real change made)
.
. sort randu
.
. list id if inrange(id, 1, 2)
+----+
| id |
|----|
5. | 2 |
6. | 1 |
+----+
.
. sort id
.
. set seed 1234567890
.
. replace randu = uniform()
(1 real change made)
.
. replace randu = randu[1] in 2
(1 real change made)
.
. sort randu
.
. list id if inrange(id, 1, 2)
+----+
| id |
|----|
5. | 1 |
6. | 2 |
+----+
.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/