Following today's exchange with N. Cox, I provide below a code based
99.9% on his hex and base functions for egen (but of course if it fails
it's due to my 0.1). Please note that it has not been fully tested.
I called this "serial", since I use it to create a "serial number"
string, but in fact what it does is converting a decimal into a base 36
number (0 to 9 and then a to z). I already explained my reasons for
preferring this type of identifier. The relative cost in terms of
efficiency is not huge, at least for the type of data I use: with 2.3
millions observations, I need a long to store a numerical id (4 bytes),
and only 25% more (a str5) to store my 36x id (with five bytes you can
get up to 36^5=60,466,176 unique identifiers, if I'm not wrong). For
datasets with many more observations the cost in terms of memory may be
much higher.
An advantage is that by changing slightly the code below, you can
produce almost "untraceable" ids: for instance, if id numbers are
relevant but the data producer wants to mask their real number (say, if
they have any meaning in the source dataset), modifying the order or the
content of the string "abcdefghijklmnopqrstuvwxyz" below will produce
(though the same is true, of course, for N. Cox's hex function). Only
the person having the original key should be able to trace the original
number (if the data is "de-sorted" by id, of course).
This is working for me - I hope there are no mistakes in the code below
and that it might be useful to someone else.
best,
g.
*! 1.0.0 NJC 20 July 2003
*Heavily based on _ghex by N. Cox
*MODIFIED BY G CRUCES 20 July 2003
program define _gserial
version 6.0
gettoken type 0 : 0
gettoken g 0 : 0
gettoken eqs 0 : 0
syntax varname(numeric) [if] [in]
marksample touse
* ignores type passed from -egen-
local type "str1"
local base = 36
capture assert `varlist' == int(`varlist') if `touse'
if _rc {
di in r "`varlist' invalid: not integer"
exit 459
}
capture assert `varlist' >= 0 if `touse'
local sign = _rc != 0
quietly {
tempvar work digit
gen `type' `g' = ""
gen long `work' = `varlist' if `touse'
gen int `digit' = .
su `work', meanonly
local max = max(`r(max)',-`r(min)')
local power = 0
while `max' >= (`base'^(`power' + 1)) {
local power = `power' + 1
}
if `sign' {
replace `g' = `g' + cond(`work' < 0, "-","+") if `touse'
replace `work' = abs(`work')
}
while `power' >= 0 {
replace `digit' = int(`work' / `base'^`power')
replace `work' = mod(`work', `base'^`power')
replace `g' = `g' + /*
*/ string(`digit') if `touse' & `digit' <= 9
* CHANGE or REORDER THE "abcd...vwxyz" below
* to produce "untraceable" IDs
replace `g' = `g' + /*
*/ substr("abcdefghijklmnopqrstuvwxyz", `digit' - 9, 1) /*
*/ if `touse' & `digit' >= 10
local power = `power' - 1
}
replace `g' = substr(`g',2,.) if substr(`g',1,1) == "0"
}
end
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/