Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: RE: Encryption of data


From   "Rodrigo A. Alfaro" <[email protected]>
To   <[email protected]>
Subject   Re: st: RE: RE: Encryption of data
Date   Wed, 13 Jun 2007 14:46:37 -0400

Very interesting discussion, I really like Maarten solution.
I was thinking in my lunch-time how to deal with the chance
of having ties in the uniform sequence (this could be important
for very large datasets). This is my small contribution.

/// Maarten's example.
*------------ begin example -----------
sysuse auto, clear
set seed 12345 /// customized choice
gen double aux1 = uniform()
set seed 23456 /// customized choice
gen double aux2 = uniform()
sort aux1 aux2 price mpg /// dealing with ties
gen key = _n /// Bill's idea.

preserve
sort key
drop make
save newauto /// new dataset
restore

keep make key
sort key
save secret /// codes

use newauto, clear
sort key
merge key using codes
*----------- end example ------------


Clearly Hendri can impose the same keys for other dataset (car).
Assuming that car is sorted by make

use code
sort make
merge make using car
drop make
save newcar

Rodrigo.




----- Original Message ----- From: "William Gould, StataCorp LP" <[email protected]>
To: <[email protected]>
Sent: Wednesday, June 13, 2007 1:25 PM
Subject: Re: st: RE: RE: Encryption of data



Since I responded to Hendri Adriaens <[email protected]> question,
who wrote that he has a dataset

I want to encrypt only a single variable, to anonimize data.
There have been a flurry of other responses, most focusing on cryptography. I
worry that someone will think that the "cryptographic" solution is better, so
I want to address that. In addition, Nick Cox <[email protected]> wrote,
"There is a minute but non-zero chance of ties on numbers drawn using
-uniform()-", which is true, and he went on to worry that would somehow
undermine what I suggested.


1. Crypotographic solutions
----------------------------

My solution, also independently suggested by by Maarten Buis
<[email protected]>, IS IN FACT a crpytographic solution; it goes under the
name "one-time pad". In our solution, the pad is applied to ids as a whole,
rather than to the digits and letters that make them up, but that is
irrelevant. The one-time pad is the strongest cryptographic solution known to
man. In fact, it can be proven that no stronger solution exists because
one-time pads CANNOT BE BROKEN! The only attack available is to steal
the mapping dataset.

The method Maarten and I suggested is not a pure one-time pad, however.
Both of us used Stata's random-number generater, and assumed a seed
provided by the user. A real one-time pad would get the random numbers
from a real random process, not a pseudo random one. The psuedo-random
process is open to attack.

The fact that Maartin and I choose to map entire ids rather than the digits
and letter in them reduces the chances of success of this kind of attack.
When using pseudo-random number, the rule is the fewer, the better.

The biggest weakness in our solution is in the selection of the seed
by a human. Humans do not choose randomly among all the integers
available, they choose among among the subset the look more random to
them, and they choose short ones.



2. Effects of ties from uniform()
----------------------------------

Nick Cox is absolutely right that -uniform()- can produce equal values,
although it is unlikly to do so. Note that I stored the -uniform()-
result as a doiuble. Anyway, Nick Cox is wrong in assuming that those
equal values cause any cryptographic problem. It is not a problem because
Stata's -sort- algorithm breaks ties randomly unless you specify the -stable-
option, and randomness is exactly what we require.

Now in fact, -sort- breaks ties pseudo randomly, so (1) applies.

It is true that, if you ran the code Maartin and I suggested twice in a row,
you might get a different mapping, but that doesn't matter. In fact,
reproducibility is not only not required in most cyrptographic situations, it
is not even desirable.

Hendri Adriaens <[email protected]> wrote,


[...] as Nick Cox mentioned, there is a tiny probability that you generate
the same number twice. So, one might need a check afterwards on duplicates
and redo the process with a different seed if there are.
There is no additional security to be gained by doing that. Ties do not
matter in this case.


-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index