Long strings | Order |
You can now use Stata’s string variables to hold exceedingly long strings, even the contents of files or even binary files.
Say we have data on 500 patients stored in our Stata dataset patients.dta.
We have doctor notes stored in 500 other files with names like notes17213.xyz, notes18417.xyz, and so on. The number in the filename is the patient’s ID.
We have the variable patid containing the patient IDs.
We can read all 500 files into our dataset:
. generate strL notes = fileread("notes_" + string(patid) + ".xyz")
Just as easily, we could re-create all 500 files.
We want to know whether the phrase “Diabetes Mellitus Type 1” appears in the doctor’s notes, which the doctor would have written as T1DM. We could type
. generate t1dm = ( strpos(notes, "T1DM") != 0 )
to create variable t1dm, which flags whether the note is in the file.
We could also type
. list glucose age weight if strpos(notes, "T1DM")
to list the variables sugar level, patient age, and patient weight wherever the doctor recorded Diabetes Mellitus Type 1.
We could even type
. regress glucose age weight if strpos(notes, "T1DM")
to run a regression of sugar level on age and weight.
Now for some details ...
A string is a sequence of characters:
Samuel Smith California U.K.
Strings can be stored in Stata datasets as string variables.
. webuse auto (1978 Automobile Data) . describe make storage display value variable name type format label variable label
make str18 %-18s Make and Model |
The variable make is a str18 variable. It can contain strings up to 18 characters long. The strings are not all 18-characters long:
. list make in 1/2
make | |||
1. | AMC Concord | ||
2. | AMC Pacer | ||
All str18 means is that the variable cannot hold a string longer than 18 characters. Even that is unimportant because Stata automatically promotes str# variables to be longer when required:
. replace make = "Mercedes Benz Gullwing" in 1 make was str18 now str22 (1 real change made)
The string-variable storage types are str1, str2, ..., str2045, and strL.
Think of it like this: after 2,045 comes L. The L stands for long. strL is pronounced sturl.
strL variables work just like str# variables:
. webuse auto, clear (1978 Automobile Data) . generate strL mymake = make . describe mymake storage display value variable name type format label variable label
mymake strL %9s |
mymake | |||
1. | AMC Concord | ||
2. | AMC Pacer | ||
strL variables can be exceedingly long, but that is not required.
We can replace strL values just as we can replace str# values:
. replace mymake = "Mercedes Benz Gullwing" in 1 (1 real change made)
We can use string functions with strL variables just as we can with str# variables:
. generate len = strlen(mymake) . generate strL first5 = substr(mymake, 1, 5) . list mymake len first5 in 1/2
mymake len first5 | |||
1. | Mercedes Benz Gullwing 22 Merce | ||
2. | AMC Pacer 9 AMC P | ||
We can even make tabulations:
. generate strL brand = word(mymake, 1) . tabulate brand
brand | Freq. Percent Cum. | |
AMC | 2 2.70 2.70 | |
Audi | 2 2.70 5.41 | |
BMW | 1 1.35 6.76 | |
(output omitted) | ||
VW | 4 5.41 98.65 | |
Volvo | 1 1.35 100.00 | |
Total | 74 100.00 |
strLs can hold binary strings. A binary string is, technically speaking, any string that contains binary 0. Here is a silly example:
. webuse auto, clear (1978 Automobile Data) . replace make = "a" + char(0) + "b" in 1 (make was str18 now strL) (1 real change made) . list make in 1
make | |||
1. | a\0b | ||
list displays binary zeros as \0.
str# variables cannot contain binary 0. strL variables can.
Read all about long strings and BLOBs in the manual entry.
See New in Stata 18 to learn about what was added in Stata 18.