Home  /  Stata News  /  Vol 31 No 4  /  Spotlight
The Stata News

«Back to main page

In the spotlight: Storing long strings and entire files in Stata datasets

Did you know that you can store files such as PDFs, images, video files, and audio files as observations in a Stata dataset? Well, you can using strL variables! A strL is a type of variable that can store strings up to two billion characters long—including ASCII, Unicode, and binary large objects (BLOBs) such as files. And all of Stata's string functions work with strL variables.

Let's see what we can do with strLs.

Importing files into strL variables

The BrainTumor.dta dataset below contains fictitious survival data for patients with a brain tumor who received surgery or chemotherapy.

. use BrainTumor.dta
(Brain tumor data)

. describe id stime died age surgery chemotherapy

storage display value variable name type format label variable label
id int %8.0g Patient Identifier stime float %8.0g Survival Time (Days) died byte %8.0g Survival Status (1=dead) age byte %8.0g Age surgery byte %8.0g Surgery chemotherapy byte %8.0g Chemotherapy
. list id stime died age surgery chemotherapy in 1/5
id stime died age surgery chemot~y
1. 1 50 1 30 0 0
2. 2 6 1 51 0 0
3. 3 16 0 54 0 1
4. 3 16 1 54 0 1
5. 4 39 0 40 0 1

For each patient, I also have a .jpg file that contains a magnetic resonance imaging (MRI) scan of the patient's brain, a text file that contains his or her physician's notes, and a text file that contains genetic sequence data from the tumor.

. ls Patient1_*
 724.4k   6/06/13  1:07  Patient1_MRI.jpg  
   1.0k   6/06/13 11:27  Patient1_PhysicianNotes.txt
   9.8k   6/06/13  0:59  Patient1_sequence.txt

I can import each file into a strL variable and incorporate information from some of them into my data analysis.

Let's begin with the physician's notes, which contain information such as the patient's ICD9 diagnostic code.

screenshot

We can use generate and specify strL as the storage type to create a new variable called notes.

generate strL notes = fileread("Patient" + string(id) + "_PhysicianNotes.txt")

All the note files are named with the same pattern: the word "Patient" followed by the patient's id followed by the string "_PhysicianNotes.txt". So we use the fileread() function to import each patient's file into the strL variable notes. Each file will be matched to the observation corresponding to the variable id in the dataset because we included string(id) in the fileread() function.

Each genetic sequence file contains long strings of the letters a, t, g, and c, which represent the nucleotides of the tumor DNA.

screenshot

Again, we can import the sequence files by using a combination of the fileread() and string() functions.

generate strL sequence = fileread("Patient" + string(id) + "_sequence.txt")

The MRI images are stored as .jpg files,

screenshot

and we can import them the same way:

generate strL mri = fileread("Patient" + string(id) + "_MRI.jpg") 

When we describe the data, we see that the variables notes, sequence, and mri all have storage type strL.

. describe id stime died age surgery chemotherapy notes sequence mri

storage display value variable name type format label variable label
id int %8.0g Patient Identifier stime float %8.0g Survival Time (Days) died byte %8.0g Survival Status (1=dead) age byte %8.0g Age surgery byte %8.0g Surgery chemotherapy byte %8.0g Chemotherapy notes strL %9s sequence strL %9s mri strL %9s

We can also list the partial contents of a strL by using the string() option.

. list id stime died age surgery chemotherapy notes sequence mri in 1, string(40)

1. id stime died age surgery chemot~y
1 50 1 30 0 0
notes
Physician: Maxwell Edison, MD Patient..
sequence
gtgcaccaactgcgatagcggtacgggttcacggacagca..
mri
��\0JFIF\0\0H\0H\0\0�\0�Exif\0\0I..

What can we do with strL variables?

We can use the variables notes and sequence as arguments in string functions because they contain ASCII data. For example, I could fit my Cox regression model restricted to patients with the ICD9 diagnostic code 239.6.

. stcox surgery chemotherapy if strpos(notes, "239.6"), nolog

         failure _d:  died
   analysis time _t:  t1
                 id:  id

Cox regression -- Breslow method for ties

No. of subjects =           60                  Number of obs    =          98
No. of failures =           51
Time at risk    =      22081.1
                                                LR chi2(2)       =       19.95
Log likelihood  =   -164.91915                  Prob > chi2      =      0.0000

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
surgery .3204121 .1964526 -1.86 0.063 .0963421 1.065619
chemotherapy .3484253 .1045638 -3.51 0.000 .193491 .6274202

Or I could generate an indicator variable for the presence of a specific genetic mutation in the patient's DNA sequence data and include it as a covariate in my Cox model.

. generate mutation = strpos(sequence, "atttatg") != 0 

. stcox surgery chemotherapy mutation, nolog

         failure _d:  died
   analysis time _t:  t1
                 id:  id

Cox regression -- Breslow method for ties

No. of subjects =          103                  Number of obs    =         172
No. of failures =           75
Time at risk    =      31938.1
                                                LR chi2(3)       =       32.35
Log likelihood  =   -282.13864                  Prob > chi2      =      0.0000

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
surgery .463197 .2046572 -1.74 0.082 .1948382 1.101178
chemotherapy .2972952 .0744543 -4.84 0.000 .1819759 .4856932
mutation 1.781056 .5909914 1.74 0.082 .9294613 3.412903

I can't view the MRI images stored in mri within Stata, but I can export the files and view them with other software. The ability to view the original MRI image could be useful if I needed to double check the raw data.

generate len = filewrite("MRI_for_Patient53.jpg", mri) if id==1

What have we learned?

strL variables allow us to store long strings—including entire files—in Stata. Here I used a biomedical example, but you could also store scanned copies of original survey data, audio recordings of interviews that have been transcribed to text, and many other kinds of data that are stored in separate files.

— Chuck Huber
Senior Statistician

«Back to main page