Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: binary format type str question

From   [email protected] (William Gould, Stata)
To   [email protected]
Subject   Re: st: binary format type str question
Date   Tue, 13 Mar 2007 09:42:55 -0500

Mark Fisher <[email protected]> writes, 

> I'm writing a Mathematica program to read stata "dta" files. [...]  I have
> Everything seems to work fine [...]  But I can't figure out how to properly
> read the data when the data types are in the range 1 to 244 (str1, str2, ...
> str244).  [...]

David Kantor <[email protected]> speculated "that the string types are stored
such that...  they have a 0-byte terminator if they are shorter than the
maximal length of the type; they have no terminator othrwise".

That would have been my guessas to Mark's problem, too, but Mark says No.

I want to suggest Mark become familiar with Stata's -hexdump- command.

Here's an example I just did:

. describe

Contains data from example.dta
  obs:             2                          
 vars:             4                          13 Mar 2007 08:37
 size:            22 (99.9% of memory free)
              storage  display     value
variable name   type   format      label      variable label
a               byte   %8.0g                  
b               str2   %9s                    
c               str3   %9s                    
d               byte   %8.0g                  
Sorted by:  

. list 

     | a    b   c   d |
  1. | 1    x       2 |
  2. | 3   yz   a   4 |

. hexdump example.dta 
                 |                                         |    character
                 |           hex representation            |  representation
         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
               0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q............... 
              10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......Ì.E....... 
              20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....Ðú 
              30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,....... 
                 |                                         |
              40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................ 
              50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma 
              60 | 7220 3230 3037 2030 383a 3337 00fb 0203 | r 2007 08:37.û.. 
              70 | fb61 0000 0000 0000 0000 0000 0000 0000 | ûa.............. 
                 |                                         |
              80 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
              90 | 0000 6200 0000 0000 0000 0000 0000 0000 | ..b............. 
              a0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
              b0 | 0000 0063 0000 0000 0000 0000 0000 0000 | ...c............ 
                 |                                         |
              c0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
              d0 | 0000 0000 6400 0000 0000 0000 0000 0000 | ....d........... 
              e0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
              f0 | 0000 0000 0000 0000 0000 0000 0000 0025 | ...............% 
                 |                                         |
             100 | 382e 3067 0000 0000 0000 0025 3973 0000 | 8.0g.......%9s.. 
             110 | 0000 0000 0000 0025 3973 0000 0000 0000 | .......%9s...... 
             120 | 0000 0025 382e 3067 0000 0000 0000 0000 | ...%8.0g........ 
             130 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
               * |                                         |
             2f0 | 0000 0000 0000 0000 0000 0000 0178 0000 | .............x.. 
             300 | 0065 0203 797a 6100 0004                | .e..yza...       


Let's work our way through this while looking at -help dta-

1.  Header

The first 109 bytes are header.  109 base 10 = 6d base 16.  Here are 
bytes 0 through 6c from the dump:

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
               0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q............... 
              10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......Ì.E....... 
              20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....Ðú 
              30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,....... 
                 |                                         |
              40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................ 
              50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma 
              60 | 7220 3230 3037 2030 383a 3337 00        | r 2007 08:37.û.. 

Mark can read this.  Note that the data and the time stamp are binary-0
terminated.  For example, the time stamp is:

              50 |                            31 3320 4d61 | ..F........13 Ma 
              60 | 7220 3230 3037 2030 383a 3337 00        | r 2007 08:37.û.. 
                                                binary 0

2.  Descriptors

The descriptor has 5 components:

	component      length
	typelist       nvar
	varlist        nvar*33
	srtlist        nvar*2 + 2
	fmtlist        nvar*12
	lbllist        nvar*33

nvar = 4 in our case.  The descriptor starts at byte 109, so let's fill in the
                                                        -- in hex --
	component      length         begin    end      begin    end
	typelist            4           109    112         6d     70
	varlist           132           113    244         71     f4
	srtlist            10           245    254         f5     fe
	fmtlist            48           255    302         ff    12e
	lbllist           132           303    434        12f    1b2
	(by the way, I type in Stata -inbase 16 #- to convert from 
	 base 10 to base 16.  E.g., -inbase 16 109-.)

So here is the typlist:

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
              60 |                                 fb 0203 | r 2007 08:37.û.. 
              70 | fb                                      | ûa.............. 

The types are

	var. 1      fb = 251  -> byte
	var. 2       2 =   2  -> str2
	var. 3       3 =   3  -> str3
	var. 4      fb = 251  -> byte

3.  Variable labels

Each variable label is 81 bytes long.  Variable labels start at byte 435:

					       -- in hex --
                    length      begin   end    begin    end
	var. 1      81            435   515      1b3    203
	var. 2      81            516   596      204    254
	var. 3      81            597   677      255    2a5
	var. 4      81            678   758      2a6    2f6

4.  Expansion fields

The expansion field starts at byte 759 (2f7 base 16).  The expansion field 
							 -- in hex --
                                length    begin  end     begin    end 
		datatype byte        1    759    759       2f7    2f7
                len                  4    760    763       2f8    2fb
		(and repeats)

Our dataset contains:

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
             2f0 |                  00 0000 0000           | .............x.. 

meaning datatype=0 and len=0, meaning there are no expansion fields.

5.  The data (at last!)

The data starts at byte 764 (hex 2fc).  Each record is an observation, which
is our case, is 1+2+3+1 = 7 bytes longs (see 2. Descriptors, above).
Thus, we have 
					      -- in hex --
                    length     begin   end    begin    end
	obs 1.           7       764   770      2fc    302
	obs 2.           7       771   777      303    309

Observation 1 is

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
             2f0 |                               0178 0000 | .............x.. 
             300 | 0065 02                                 | .e..yza...       

and observation 2 is

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
             300 |        03 797a 6100 0004                | .e..yza...       

Let's break apart observation 1:

                 type    hex value    meaning
	var 1.   byte    01           numeric 1
	var 2.   str2    7800         string 7800 = "x"  (0 terminated)
	var 3.   str3    000065       string 000076 = "" (0 terminated)
	var 4.   byte    02           numeric 2

Note that var3 is 000076.  The binary 0 is right up front, so the string 
is "".  the 0076 that follows is junk and ignorred.

Let's break apart observations 2:

                 type    hex value    meaning
	var 1.   byte    03           numeric 3
	var 2.   str2    797a         string 797a = "yz"  (not 0 terminated)
	var 3.   str3    610000       string 610000 = "a" (0 terminated)
	var 4.   byte    04           numeric 4

Note that var 2 is not zero terminated.  If we were storing the string 
in a language that required 0 termination (say C), we would code 

		mempcy(dest, bufpos, 2) ; dest[2] = '\0' ;


I hope this helps.  

Mark was worried that there was something about about how strings appear 
in the .dta dataset.  There is nothing strange except for the lack of 0 
termination when the string is full length, and 0 termination when less 
than full length.

Mark needs to -hexdump- his dataset and then include debug code in his 

-- Bill
[email protected]
*   For searches and help try:

© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index