Since I responded to Hendri Adriaens <[email protected]> question,
who wrote that he has a dataset
> I want to encrypt only a single variable, to anonimize data.
There have been a flurry of other responses, most focusing on cryptography. I
worry that someone will think that the "cryptographic" solution is better, so
I want to address that. In addition, Nick Cox <[email protected]> wrote,
"There is a minute but non-zero chance of ties on numbers drawn using
-uniform()-", which is true, and he went on to worry that would somehow
undermine what I suggested.
1. Crypotographic solutions
----------------------------
My solution, also independently suggested by by Maarten Buis
<[email protected]>, IS IN FACT a crpytographic solution; it goes under the
name "one-time pad". In our solution, the pad is applied to ids as a whole,
rather than to the digits and letters that make them up, but that is
irrelevant. The one-time pad is the strongest cryptographic solution known to
man. In fact, it can be proven that no stronger solution exists because
one-time pads CANNOT BE BROKEN! The only attack available is to steal
the mapping dataset.
The method Maarten and I suggested is not a pure one-time pad, however.
Both of us used Stata's random-number generater, and assumed a seed
provided by the user. A real one-time pad would get the random numbers
from a real random process, not a pseudo random one. The psuedo-random
process is open to attack.
The fact that Maartin and I choose to map entire ids rather than the digits
and letter in them reduces the chances of success of this kind of attack.
When using pseudo-random number, the rule is the fewer, the better.
The biggest weakness in our solution is in the selection of the seed
by a human. Humans do not choose randomly among all the integers
available, they choose among among the subset the look more random to
them, and they choose short ones.
2. Effects of ties from uniform()
----------------------------------
Nick Cox is absolutely right that -uniform()- can produce equal values,
although it is unlikly to do so. Note that I stored the -uniform()-
result as a doiuble. Anyway, Nick Cox is wrong in assuming that those
equal values cause any cryptographic problem. It is not a problem because
Stata's -sort- algorithm breaks ties randomly unless you specify the -stable-
option, and randomness is exactly what we require.
Now in fact, -sort- breaks ties pseudo randomly, so (1) applies.
It is true that, if you ran the code Maartin and I suggested twice in a row,
you might get a different mapping, but that doesn't matter. In fact,
reproducibility is not only not required in most cyrptographic situations, it
is not even desirable.
Hendri Adriaens <[email protected]> wrote,
> [...] as Nick Cox mentioned, there is a tiny probability that you generate
> the same number twice. So, one might need a check afterwards on duplicates
> and redo the process with a different seed if there are.
There is no additional security to be gained by doing that. Ties do not
matter in this case.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/