| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: binary format type str question
Wow, thanks so much for your help. Let me say first that I don't have
access to Stata, so I can't do a -hexdump-. For reference, I'm using
http://www.stata.com/help.cgi?dta
Pretty much everything in this document (plus everything in your email)
makes sense to me. But I can't make the mapping between the typelist
that I get and what's in the data part of the file.
I've learned a bit more about the structure of the file in question.
I read the file (correctly, I think) right up to the point where the
data start. Then, in order to do some deconstrubtion, I simply read
*all* the remaining bytes in the file; there are only 1071 of them.
Since there are 6 variables (with types 98, 136, 102, 105, 102, and 98)
and 51 observations, I don't see how I can possibly account for all of
them since this only allows for 21 bytes per observation.
But a clear pattern emerges that if I partition the list of bytes into a
matrix of 51 rows and 21 columns. The first column contains byte values
running consecutively from 1 to 51 --- apparently an index encoded as
the byte value itself. (How do I make a correspondence between type 98
and this variable?) The next two columns contain two characters: state
abbreviations (such as AL, AK, AZ, ...). (Again, how do I make a
correspondence between type 136 and this variable?) Then next 7 columns
(that is columns 3 to 9) are identical row by row: {0, 1, 12, 0, 0, 0,
64}. None of the remaining columns has identical rows. (Some of the
remaining columns have zeros in them.)
Anyway, that's where I stand. Is it possible this dta file was created
in a nonstandard way? (All the dta files I have are from Andrew Gelman's
web site for his new "Data Analysis" book. The one I can actually read
says "Written by R." in the data_label.) Are there other dta files
available on the web that I can experiment with?
--Mark.
William Gould, Stata wrote:
Mark Fisher <[email protected]> writes,
I'm writing a Mathematica program to read stata "dta" files. [...] I have
Everything seems to work fine [...] But I can't figure out how to properly
read the data when the data types are in the range 1 to 244 (str1, str2, ...
str244). [...]
David Kantor <[email protected]> speculated "that the string types are stored
such that... they have a 0-byte terminator if they are shorter than the
maximal length of the type; they have no terminator othrwise".
That would have been my guessas to Mark's problem, too, but Mark says No.
I want to suggest Mark become familiar with Stata's -hexdump- command.
Here's an example I just did:
============================================================================
. describe
Contains data from example.dta
obs: 2
vars: 4 13 Mar 2007 08:37
size: 22 (99.9% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
a byte %8.0g
b str2 %9s
c str3 %9s
d byte %8.0g
-------------------------------------------------------------------------------
Sorted by:
. list
+----------------+
| a b c d |
|----------------|
1. | 1 x 2 |
2. | 3 yz a 4 |
+----------------+
. hexdump example.dta
| | character
| hex representation | representation
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
-----------------+-----------------------------------------+-----------------
0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q...............
10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......Ì.E.......
20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....�ú
30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,.......
| |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................
50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma
60 | 7220 3230 3037 2030 383a 3337 00fb 0203 | r 2007 08:37.û..
70 | fb61 0000 0000 0000 0000 0000 0000 0000 | ûa..............
| |
80 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................
90 | 0000 6200 0000 0000 0000 0000 0000 0000 | ..b.............
a0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................
b0 | 0000 0063 0000 0000 0000 0000 0000 0000 | ...c............
| |
c0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................
d0 | 0000 0000 6400 0000 0000 0000 0000 0000 | ....d...........
e0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................
f0 | 0000 0000 0000 0000 0000 0000 0000 0025 | ...............%
| |
100 | 382e 3067 0000 0000 0000 0025 3973 0000 | 8.0g.......%9s..
110 | 0000 0000 0000 0025 3973 0000 0000 0000 | .......%9s......
120 | 0000 0025 382e 3067 0000 0000 0000 0000 | ...%8.0g........
130 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................
* | |
2f0 | 0000 0000 0000 0000 0000 0000 0178 0000 | .............x..
300 | 0065 0203 797a 6100 0004 | .e..yza...
============================================================================
Let's work our way through this while looking at -help dta-
1. Header
----------
The first 109 bytes are header. 109 base 10 = 6d base 16. Here are
bytes 0 through 6c from the dump:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q...............
10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......Ì.E.......
20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....�ú
30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,.......
| |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................
50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma
60 | 7220 3230 3037 2030 383a 3337 00 | r 2007 08:37.û..
Mark can read this. Note that the data and the time stamp are binary-0
terminated. For example, the time stamp is:
50 | 31 3320 4d61 | ..F........13 Ma
60 | 7220 3230 3037 2030 383a 3337 00 | r 2007 08:37.û..
\
binary 0
2. Descriptors
---------------
The descriptor has 5 components:
component length
------------------------
typelist nvar
varlist nvar*33
srtlist nvar*2 + 2
fmtlist nvar*12
lbllist nvar*33
------------------------
nvar = 4 in our case. The descriptor starts at byte 109, so let's fill in the
table:
-- in hex --
component length begin end begin end
-------------------------------------------------------------
typelist 4 109 112 6d 70
varlist 132 113 244 71 f4
srtlist 10 245 254 f5 fe
fmtlist 48 255 302 ff 12e
lbllist 132 303 434 12f 1b2
-------------------------------------------------------------
(by the way, I type in Stata -inbase 16 #- to convert from
base 10 to base 16. E.g., -inbase 16 109-.)
So here is the typlist:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
60 | fb 0203 | r 2007 08:37.û..
70 | fb | ûa..............
The types are
type
------------------------------
var. 1 fb = 251 -> byte
var. 2 2 = 2 -> str2
var. 3 3 = 3 -> str3
var. 4 fb = 251 -> byte
------------------------------
3. Variable labels
-------------------
Each variable label is 81 bytes long. Variable labels start at byte 435:
-- in hex --
length begin end begin end
--------------------------------------------------
var. 1 81 435 515 1b3 203
var. 2 81 516 596 204 254
var. 3 81 597 677 255 2a5
var. 4 81 678 758 2a6 2f6
---------------------------------------------------
4. Expansion fields
--------------------
The expansion field starts at byte 759 (2f7 base 16). The expansion field
contains
-- in hex --
length begin end begin end
-----------------------------------------------------
datatype byte 1 759 759 2f7 2f7
len 4 760 763 2f8 2fb
(and repeats)
-----------------------------------------------------
Our dataset contains:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
2f0 | 00 0000 0000 | .............x..
meaning datatype=0 and len=0, meaning there are no expansion fields.
5. The data (at last!)
-----------------------
The data starts at byte 764 (hex 2fc). Each record is an observation, which
is our case, is 1+2+3+1 = 7 bytes longs (see 2. Descriptors, above).
Thus, we have
-- in hex --
length begin end begin end
--------------------------------------------------
obs 1. 7 764 770 2fc 302
obs 2. 7 771 777 303 309
--------------------------------------------------
Observation 1 is
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
2f0 | 0178 0000 | .............x..
300 | 0065 02 | .e..yza...
and observation 2 is
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
300 | 03 797a 6100 0004 | .e..yza...
Let's break apart observation 1:
type hex value meaning
------------------------------------------------------
var 1. byte 01 numeric 1
var 2. str2 7800 string 7800 = "x" (0 terminated)
var 3. str3 000065 string 000076 = "" (0 terminated)
var 4. byte 02 numeric 2
---------------------------------------------------
Note that var3 is 000076. The binary 0 is right up front, so the string
is "". the 0076 that follows is junk and ignorred.
Let's break apart observations 2:
type hex value meaning
--------------------------------------------------------------------
var 1. byte 03 numeric 3
var 2. str2 797a string 797a = "yz" (not 0 terminated)
var 3. str3 610000 string 610000 = "a" (0 terminated)
var 4. byte 04 numeric 4
--------------------------------------------------------------------
Note that var 2 is not zero terminated. If we were storing the string
in a language that required 0 termination (say C), we would code
mempcy(dest, bufpos, 2) ; dest[2] = '\0' ;
Conclusion
----------
I hope this helps.
Mark was worried that there was something about about how strings appear
in the .dta dataset. There is nothing strange except for the lack of 0
termination when the string is full length, and 0 termination when less
than full length.
Mark needs to -hexdump- his dataset and then include debug code in his
program.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/