Stephen Mennemeyer
>
> The Stata 7 Manual U 16.10 briefly mentions that some
> problems can arise
> due to the fact that STATA stores numbers in single precision and
> estimates them in double precision.
>
> I have found another situation where double precision is
> required in
> STATA for some numbers that would seem to be safely
> manipulated with
> single precisions.
>
> Consider a typical -- hypothetical --Social Security number
> stored as a
> string variable.
> 123-45-6789
>
> On might wish to convert this to a numeric variable for easier
> manipulation such as doing sorts in a more appealing manner.
>
> If one parses the SSN into its numeric components,
> multiplies them up
> to the appropriate scale and then adds them back
> together, the result
> is a bit surprising. This process is better done in double
> precsion to
> get the expected result.
>
> list ssnchar /* from an external file of hypothtical SSNs*/
>
> ssnchar
> 1. 123-45-6789
> 2. 987-65-4321
> 3. 078-94-5612
> 4. 321-65-7894
> 5. 978-54-6231
>
> /* parse the string variable ssnchar into its component
> parts, multiply
> them up to the appropriate position in the future number
> and then add
> the parts */
> . gen double p1=real(substr(ssnchar,1,3))
> . gen double p2=real(substr(ssnchar,5,2))
> . gen double p3=real(substr(ssnchar,8,4))
> . gen double ssndbl=p1*1000000+p2*10000+p3
> . format ssndbl %9.0f
>
> /* with double precision the results are as expected */
> . list
> ssnchar p1 p2 p3 ssndbl
> 1. 123-45-6789 123 45 6789 123456789
> 2. 987-65-4321 987 65 4321 987654321
> 3. 078-94-5612 78 94 5612 78945612
> 4. 321-65-7894 321 65 7894 321657894
> 5. 978-54-6231 978 54 6231 978546231
>
> /* if we use the float form of the number , the resulting
> variable ssnf
> is not what might be anticipated */
> . gen p1f=real(substr(ssnchar,1,3))
> . gen p2f=real(substr(ssnchar,5,2))
> . gen p3f=real(substr(ssnchar,8,4))
> . gen ssnf=p1f*1000000+p2f*10000+p3f
>
> . format ssnf %9.0f
> . list
> . list ssn*
> ssnchar ssndbl ssnf
> 1. 123-45-6789 123456789 123456792
> 2. 987-65-4321 987654321 987654336
> 3. 078-94-5612 78945612 78945616
> 4. 321-65-7894 321657894 321657888
> 5. 978-54-6231 978546231 978546240
>
The question of precision when handling large integers
has been raised many times on this list over the years.
This example is another useful warning of how things
can go wrong. I add a few comments and a question.
1. In addition to the manual, there is a fairly detailed
discussion of holding numbers as compared with holding
strings in
On numbers and strings. Stata Journal 2(3):314--329
(2002)
which on the whole sings the praises of holding
identifiers like (United States) Social Security
Numbers as strings.
2. What drives this is the need to hold every
digit exactly. For large integers with 9 digits,
the -long- data type should be fine.
3. Splitting and recombining is not the easiest
solution here, even if you want the nearest
numeric equivalent of strings like "123-45-6789".
One better way to do it is
. destring ssnchar, gen(ssn) ignore("-")
and another is
. gen long SSN = real(subinstr(ssnchar,"-","",.))
The second has the advantage that the format is
automatically sensible. Both take one line
and avoid the creation of separate variables
for the parts.
4. In this particular case, I don't understand what
the problem is with sort order when the variable
is held as string. Please elaborate.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/