Sometimes, categorical data are stored as strings. For example, the variable race may be stored with the words "Black", "Other", and "White". We will need to convert these variables to numeric data before we can use them with Stata's statistical features.
Let's begin by opening and describing an example dataset from the Stata website.
. use https://www.stata.com/users/youtube/rawdata.dta, clear (Fictitious data based on the National Health and Nutrition Examination Survey) . describe Contains data from https://www.stata.com/users/youtube/rawdata.dta Observations: 1,268 Fictitious data based on the National Health and Nutrition Examination Survey Variables: 10 6 Jul 2016 11:17 (_dta has notes)
Variable Storage Display Value |
name type format label Variable label |
id str6 %9s Identification Number age byte %9.0g sex byte %9.0g Sex race str5 %9s Race height float %9.0g height (cm) weight float %9.0g weight (kg) sbp int %9.0g Systolic blood pressure (mm/Hg) dbp int %9.0g Diastolic blood pressure (mm/Hg) chol str3 %9s serum cholesterol (mg/dL) dob str18 %18s |
The storage type for the variable race is a 5-character string. Let's tabulate race to view the categories.
. tabulate race
Race | Freq. Percent Cum. | |
Black | 176 13.88 13.88 | |
Other | 22 1.74 15.62 | |
White | 1,070 84.38 100.00 | |
Total | 1,268 100.00 |
There are three categories stored as the strings: Black, Other, and White. We can use Stata's encode command to generate a new variable named racen.
. encode race, gen(racen)
Let's type tabulate race racen to view a cross-tabulation of the two variables and list race racen in 1/5 to view some raw data.
. tabulate race racen . list race racen in 1/5
race racen | |
1. | White White |
2. | White White |
3. | White White |
4. | Black Black |
5. | White White |
The two variables appear to be identical. Next let's describe both variables.
. describe race racen
Variable Storage Display Value |
name type format label Variable label |
race str5 %9s Race racen long %8.0g racen Race |
The storage type for race is "str5", and the storage type for racen is "long", which is a type of numeric variable. You can type help data_types to learn more about different types of numeric data. Notice that the Value label for racen is "racen". Let's type label list racen to view the labels.
. label list racen racen: 1 Black 2 Other 3 White
The variable racen is a numeric variable where 1 represents Black, 2 represents Other, and 3 represents White. This will allow us to use racen with Stata's statistical features such as regression modeling.
Note that there is a decode command that will do the reverse of encode: it will convert labeled numeric categorical variables to string variables.
. decode racen, gen(races)
We can use describe and list to verify that it worked.
. describe race racen races . list race racen races in 1/5
race racen races | |
1. | White White White |
2. | White White White |
3. | White White White |
4. | Black Black Black |
5. | White White White |
The raw data look the same for all three variables, but, as we have learned, the storage type is important. And now we know how to convert between types when necessary.
You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about these commands by clicking on the links to the Stata manual entries below.
Watch Data management: How to convert categorical string variables to labeled numeric variables.
Read more in the Stata Data Management Reference Manual; see [D] describe , [D] encode, and [D] save. In the Stata Base Reference Manual, see [R] summarize.