Cruces,GA
>
> Working with large datasets, I've found a problem regarding
> observations
> id: my originals are way too long (say, strings 20 of the form
> "PROVINCE-CITY-HOUSEHOLD..."). The id variable only is
> sometimes half of
> my file. Generating numerical ids (as explained in a very
> useful FAQ by
> N. Cox) is useful, but then I sometimes have problems with
> the rounding
> of numbers (since I have ids from 1 to, say, 16 millions).
>
> I thought about a solution which uses strings but is more
> compact than
> my original, which is storing numerical ids as strings in
> hexadecimal
> notation. I've found a discussion by W. Gould on this list, but this
> referred basically as hex as a form of displaying numbers
> (from a FAQ:
> "Stata also provides a special %21x format that shows the
> exact value in
> a special hexadecimal format").
>
> I was wondering how I can go from a float (numerical id) to
> a compact
> string showing the hexadecimal value (perhaps even more
> compact than the
> %21x format since I only have positive integers). There
> might also be
> the problem of loss of precision in the conversion, and of
> course I need
> to avoid that.
>
> I guess my question boils down to converting a variable
> from its value
> to a string with its displayed value.
The FAQ which Guillermo refers is presumably
How do I create individual identifiers numbered from 1 upwards?
http://www.stata.com/support/faqs/data/group.html
which is by William Gould and myself. One key point not
stressed in that FAQ is that it is often useful -- indeed
sometimes essential -- to specify -long- for a numeric id. There
should be absolutely no problem in holding distinct
ids for this number of observations, so long as they
are held as integers. I guess that this is the main
answer to the underlying problem here.
I don't follow precisely what would be gained by what Guillermo
is suggesting, as it is difficult to improve on the efficiency
of mapping to integers. But you can use %21x as an argument to
-string()-, just like any other legal numeric display format.
However, it is special and is likely to produce _longer_ strings.
-inbase- (Stata 8, undocumented) works on individual numbers
only.
Alternatively, this is a variant on -base()- in -egenmore-
on SSC. e.g. egen hexid = hex(id), where id contains
integers _only_.
*! 1.0.0 NJC 20 July 2003
program define _ghex
version 6.0
gettoken type 0 : 0
gettoken g 0 : 0
gettoken eqs 0 : 0
syntax varname(numeric) [if] [in]
marksample touse
* ignores type passed from -egen-
local type "str1"
local base = 16
capture assert `varlist' == int(`varlist') if `touse'
if _rc {
di in r "`varlist' invalid: not integer"
exit 459
}
capture assert `varlist' >= 0 if `touse'
local sign = _rc != 0
quietly {
tempvar work digit
gen `type' `g' = ""
gen long `work' = `varlist' if `touse'
gen int `digit' = .
su `work', meanonly
local max = max(`r(max)',-`r(min)')
local power = 0
while `max' >= (`base'^(`power' + 1)) {
local power = `power' + 1
}
if `sign' {
replace `g' = `g' + cond(`work' < 0, "-","+") if `touse'
replace `work' = abs(`work')
}
while `power' >= 0 {
replace `digit' = int(`work' / `base'^`power')
replace `work' = mod(`work', `base'^`power')
replace `g' = `g' + /*
*/ string(`digit') if `touse' & `digit' <= 9
replace `g' = `g' + /*
*/ substr("abcdef", `digit' - 9, 1) if `touse' & `digit' >= 10
local power = `power' - 1
}
replace `g' = substr(`g',2,.) if substr(`g',1,1) == "0"
}
end
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/