Hendri Adriaens <[email protected]> has a dataset and writes,
> I want to encrypt only a single variable, to anonimize data.
Here is what I recommend.
Let's call the data actual.dta and assume it has variable uid, which is
the official user identification number that we want to encrypt.
uid can be a string or numeric, I don't care. uid might contain
136980408 recorded as a double or long, or
"136-98-408" recorded as a string, or even
"James Smith" recorded as a string.
In what follows, we will allow the repeated repeated values of uid in the
dataset. What we are going to do is come up with new id numbers, use those,
and lock up the mapping of uid from newid.
Here's step 1:
. use actual, clear
. keep uid
. sort uid
. by uid: keep if _n==1
. set seed _______ <- fill this in with a random number
. gen double random = uniform()
. sort random
. gen long newid = _n
. sort uid
. save mapping, replace
New dataset mapping.dta contains two variables: uid and the corresponding
newid. Next, we fix actual.dta for public consumption:
. use actual
. sort uid
. merge uid using mapping
. assert _merge==3
. drop _merge uid
. save actual, replace
Finally, we put mapping.dta in a save place. I would write multiple copies
of actual.dta on multiple CDs and put the CDs in multiple safes. Dataset
mapping contains all the secret information.
Dataset actual.dta no longer contains uid; it contains newid.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/