This FAQ is an expanded version of an earlier reply to a question on Statalist. This question, or a some variation of it, is one of the most common ICD-related topics that we see in Technical Support.
Title | Converting string ICD-9 codes to numeric | |
Author | Rebecca Pope, StataCorp |
The short answer is
you can use the destring command or the real() function. But, in most cases, it is a bad idea to convert codes from string to numeric because you can lose information. For details, see the long answer below.
If you are asking this question because you want to check for a range of codes, see
How do I check a variable for a range of diagnosis or procedure codes?
Or, if you want to label the codes with their descriptions, see
How do I label my diagnosis or procedure codes with their descriptions?
Note that you will not save space by storing full ICD-9 codes as numeric variables. Because you will likely use a dot with a numeric variable, Stata will store these values as doubles to retain the correct numeric precision, requiring 8 bytes of storage space. By contrast, storing a string variable with an implied dot needs only 5 characters, or 5 bytes.
Mechanically, the answer to this question is simple. To convert from string to numeric data in Stata, you have your choice between
. destring stringvar, generate(numvar)
or
. generate numvar = real(stringvar)
Which you choose is mostly a matter of personal preference. destring returns an error if it encounters nonnumeric values. generate with the real() function bypasses the error, returning missing values instead.
However, ICD-9 and ICD-9-CM codes are not ordinary numerical strings. ICD-9 codes are diagnosis codes that are of the form ###, ###.#, E###, or E###.#, where # indicates a digit from 0 to 9. In the U.S., ICD-9 is commonly used as a shorthand for referring to ICD-9-CM codes. These are diagnosis codes of the form ###, ###.#, ###.##, E###, E###.#, V##, V##.#, or V##.## and procedure codes of the form ##, ##.#, or ##.##.
Obviously, the codes that start with E and V have no numeric representation. We can see this interactively by typing
. display real("E800.1") .
You have to confront the issue of leading and trailing zeros even if you do not have any of these E-codes (external cause of injury) and V-codes (nondisease encounter with the system) in your data.
Consider the following data:
+-----------------------------------+ dx descr +-----------------------------------+ 1. 25013 dmi ketoacd uncontrold 2. 0389 septicemia nos 3. 3890 conductive hearing loss* 4. 0030 salmonella enteritis 5. 030 leprosy* ------------------------------------- 6. 410 acute myocardial infarct* 7. 4100 ami anterolateral wall* 8. 41000 ami anterolateral,unspec +-----------------------------------+
Many administrative systems store codes with what is called an "implied dot", meaning that the code 250.13, for example, is stored as "25013". Whether a dot is present matters not at all when you convert between string and numeric for a code like 250.13. However, it matters quite a bit for a code like 038.9, because if you convert your string codes to a numeric storage type, they will not retain the leading zeros.
To see this, create two new variables called num_dx and num_dot. num_dx is a straight numeric conversion of the string in dx.
. destring dx, generate(num_dx)
Now, create a version of dx that includes the dot, and then destring that variable.
. clonevar dx_dot = dx . icd9 clean dx_dot, dots . destring dx_dot, generate(num_dot) +-----------------------------------+ dx num_dx dx_dot num_dot ------------------------------------- 1. 25013 25013 250.13 250.13 2. 0389 389 038.9 38.9 3. 3890 3890 389.0 389 4. 0030 30 003.0 3 5. 030 30 030 30 +-----------------------------------+
Looking at rows 4 and 5, we see that the leading zeros are stripped from the string during the string-to-numeric conversion of dx, meaning that we lost the distinction between a 3-digit and a 4-digit code. Thus, when the dot is omitted, 003.0 and 030 both map to the numeric value 30. However, if we destring dx_dot instead, 003.0 maps to 3 and 030 maps to 30, not an altogether bad approximation of the values. It is tempting to conclude that as long as we only destring codes with dots, we are protected against errors. But this is not the case.
Observe the results for the last three rows of the data:
+-----------------------------------+ dx num_dx dx_dot num_dot ------------------------------------- 6. 410 410 410 410 7. 4100 4100 410.0 410 8. 41000 41000 410.00 410 +-----------------------------------+
In this case, num_dx, the variable created by converting our original string diagnosis, provides the best results. num_dot, the variable created by converting the string diagnosis code that includes a dot, results in all values being equal to 410. 410.00 is a billable code, the other two codes are not.
If you work with prevalidated data, you may never encounter this situation. However, if you are working with data that have not been cleaned by a third party to verify that all codes are complete (known as "terminal codes" in coding jargon), you are likely to encounter the problem of category codes and subcategory codes mapping to the same numeric value.
What about combining the two approaches? We could convert some codes from dot form and some from implied-dot form. Here is one way to do the prep work:
. icd9 clean dx . clonevar dx_dot = dx . icd9 clean dx_dot, dots . generate toconvert = dx if substr(dx,-2,2)=="00" . replace toconvert = dx_dot if missing(toconvert) . destring toconvert, generate(hybrid) +----------------+ dx hybrid ------------------ 1. 25013 250.13 2. 0389 38.9 3. 3890 389 4. 0030 3 5. 030 30 ------------------ 6. 410 410 7. 4100 4100 8. 41000 41000 +----------------+
All appears well until two patients show up, one with a neurotic disorder (perhaps the result of data management for ICD-9 codes) and one with lepromatus leprosy.
+---------------+ dx hybrid ----------------- 9. 300 300 10. 0300 300 +---------------+
The above examples are clearly contrived to highlight problems you might encounter. But how does this conversion affect real datasets? To get a sense of the scope of the problem, we can use the nhds2010.dta sample dataset. This dataset is an extract of all same-day discharges from U.S. hospitals for adults in 2010 taken from the National Hospital Discharge Survey (NHDS).
The diagnosis codes in the NHDS are prevalidated, so we do not have to worry about category codes appearing alongside terminal codes. In a way, this is an ideal case.
In our example dataset, we have 779 unique string diagnosis codes. We destring the variable containing the primary diagnosis code, formatted with a dot.
. icd9 clean dx1, dots (2210 changes made) . destring dx1, generate(num_dx1) force dx1 contains nonnumeric characters; num_dx1 generated as double (66 missing values generated)
We are left with 749 distinct codes, but this is because we lost 30 distinct V-codes in 66 records. So in this prevalidated dataset, we do not have the problem of multiple string codes mapping to the same numeric value.
What happens when we try to recover our string codes using tostring? Obviously, we lost the V-code information, but let's assume for now that it has been stored in another variable and focus on the string codes that have number values only. We can convert the numeric codes back to string using tostring.
. tostring num_dx1, generate(str_dx1) format(%06.2f) str_dx1 generated as str6 . replace str_dx1="" if missing(num_dx1) (66 real changes made) Now, we can check how often these codes match the original. . count if dx1 != str_dx1 & !missing(str_dx1) 1,048
We get back accurate information for less than half of our 2,210 observations.
In the original primary diagnosis variable (dx1), all codes are defined and valid. However, even though we maintained a unique mapping for our numeric values, we do not recover accurate codes.
. icd9 check str_dx1, generate(prob) (str_dx1 contains 66 missing values) str_dx1 contains undefined codes: 1. Invalid placement of period 0 2. Too many periods 0 3. Code too short 0 4. Code too long 0 5. Invalid 1st char (not 0-9, E, or V) 0 6. Invalid 2nd char (not 0-9) 0 7. Invalid 3rd char (not 0-9) 0 8. Invalid 4th char (not 0-9) 0 9. Invalid 5th char (not 0-9) 0 10. Code not defined 1,046 ----------- Total 1,046
The string-numeric-string conversion resulted in missing information for 66 observations. The conversion also resulted in 1,048 incorrect codes, of which 1,046 are undefined codes. However, two records have incorrect diagnoses that appear to be legitimate.
We can generate descriptions for the original and recovered ICD-9 code and then compare them. For our data, they are
+---------------------------------------------------------------------+ dx1 right_descr str_dx1 wrong_descr ----------------------------------------------------------------------- 174. 173.1 malig neo skin eyelid# 173.10 mal neo eyelid/canth nos 175. 173.5 malig neo skin trunk# 173.50 malig neo skin trunk nos +---------------------------------------------------------------------+
In this case, we've replaced the code that was correct and current in 2010 with one that did not go into effect until 2011 (you can find this information by typing icd9 query).
In conclusion, there is no "lossless" way to convert string ICD-9 codes to a numeric storage type.