|  |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: binary format type str question
Hello Mark,
if you haven't solved this problem yet, I would suggest that you use another
dataset to see if the problem is file-specific or code-specific.
E.g. try a trivial case -- a dataset with only one string variable and see
if your code can get it right. Alternatively try it with a publicly
available dataset, so that the statalisters can also have a look at it.
You have mentioned that you read all the data as one chunk. I would suggest
you reading data observation by observation, by defining a record structure
based on the file header.
If the size of the data area is different from what you expect, check if you
handle the Hi/Lo byte order correctly when you read the header.
Another hint is this C code to read Stata files (2002) by Thomas Lumley. It
is a part of the Foreign package for R. You can download it here:
(choose package source in gz format even if you work in windows. Windows
binary archive does not contain the source code).
Below is a UUEncoded trivial file:
begin 644 test.dta
sum -r/size 45278/314
Which looks in Stata as:
obs: 6
vars: 1 14 Mar 2007 10:34
size: 60 (99.9% of memory free)
storage display value
variable name type format label variable label
var1 str6 %9s
Sorted by:
And contains the following data:
. l,noo
If your code can correctly parse this file, then the dataset that you have
might be written in a different format.
----- Original Message -----
From: "Mark Fisher" <[email protected]>
To: <[email protected]>
Sent: Tuesday, March 13, 2007 7:12 PM
Subject: Re: st: binary format type str question
Wow, thanks so much for your help. Let me say first that I don't have
access to Stata, so I can't do a -hexdump-. For reference, I'm using
Pretty much everything in this document (plus everything in your email)
makes sense to me. But I can't make the mapping between the typelist that
I get and what's in the data part of the file.
I've learned a bit more about the structure of the file in question.
I read the file (correctly, I think) right up to the point where the data
start. Then, in order to do some deconstrubtion, I simply read *all* the
remaining bytes in the file; there are only 1071 of them. Since there are
6 variables (with types 98, 136, 102, 105, 102, and 98) and 51
observations, I don't see how I can possibly account for all of them since
this only allows for 21 bytes per observation.
But a clear pattern emerges that if I partition the list of bytes into a
matrix of 51 rows and 21 columns. The first column contains byte values
running consecutively from 1 to 51 --- apparently an index encoded as the
byte value itself. (How do I make a correspondence between type 98 and
this variable?) The next two columns contain two characters: state
abbreviations (such as AL, AK, AZ, ...). (Again, how do I make a
correspondence between type 136 and this variable?) Then next 7 columns
(that is columns 3 to 9) are identical row by row: {0, 1, 12, 0, 0, 0,
64}. None of the remaining columns has identical rows. (Some of the
remaining columns have zeros in them.)
Anyway, that's where I stand. Is it possible this dta file was created in
a nonstandard way? (All the dta files I have are from Andrew Gelman's web
site for his new "Data Analysis" book. The one I can actually read says
"Written by R." in the data_label.) Are there other dta files available on
the web that I can experiment with?
William Gould, Stata wrote:
Mark Fisher <[email protected]> writes,
I'm writing a Mathematica program to read stata "dta" files. [...] I
Everything seems to work fine [...] But I can't figure out how to
read the data when the data types are in the range 1 to 244 (str1, str2,
str244). [...]
David Kantor <[email protected]> speculated "that the string types are
such that... they have a 0-byte terminator if they are shorter than the
maximal length of the type; they have no terminator othrwise".
That would have been my guessas to Mark's problem, too, but Mark says No.
I want to suggest Mark become familiar with Stata's -hexdump- command.
Here's an example I just did:
. describe
Contains data from example.dta
obs: 2 vars: 4
13 Mar 2007 08:37
size: 22 (99.9% of memory free)
storage display value
variable name type format label variable label
a byte %8.0g b str2 %9s
c str3 %9s d byte
%8.0g -------------------------------------------------------------------------------
Sorted by: . list +----------------+
| a b c d |
1. | 1 x 2 |
2. | 3 yz a 4 |
. hexdump example.dta | |
| hex representation |
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
0 | 7102 0100 0400 0200 0000 0000 0000 0300 |
q............... 10 | 0000 0000 0000 cc00 4500 0000 0000 0000 |
......Ì.E....... 20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa |
......¬Kf.....�ú 30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 |
²......±,....... | |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 |
................ 50 | 0000 4600 0000 0800 0000 0031 3320 4d61 |
..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00fb 0203 | r 2007
08:37.û.. 70 | fb61 0000 0000 0000 0000 0000 0000 0000 |
ûa.............. | |
80 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ 90 | 0000 6200 0000 0000 0000 0000 0000 0000 |
..b............. a0 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ b0 | 0000 0063 0000 0000 0000 0000 0000 0000 |
...c............ | |
c0 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ d0 | 0000 0000 6400 0000 0000 0000 0000 0000 |
....d........... e0 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ f0 | 0000 0000 0000 0000 0000 0000 0000 0025 |
...............% | |
100 | 382e 3067 0000 0000 0000 0025 3973 0000 |
8.0g.......%9s.. 110 | 0000 0000 0000 0025 3973 0000 0000 0000 |
.......%9s...... 120 | 0000 0025 382e 3067 0000 0000 0000 0000 |
...%8.0g........ 130 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ * | |
2f0 | 0000 0000 0000 0000 0000 0000 0178 0000 |
.............x.. 300 | 0065 0203 797a 6100 0004 |
Let's work our way through this while looking at -help dta-
1. Header
The first 109 bytes are header. 109 base 10 = 6d base 16. Here are
bytes 0 through 6c from the dump:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
0 | 7102 0100 0400 0200 0000 0000 0000 0300 |
q............... 10 | 0000 0000 0000 cc00 4500 0000 0000 0000 |
......Ì.E....... 20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa |
......¬Kf.....�ú 30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 |
²......±,....... | |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 |
................ 50 | 0000 4600 0000 0800 0000 0031 3320 4d61 |
..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00 | r 2007
08:37.û.. Mark can read this. Note that the data and the time stamp are
terminated. For example, the time stamp is:
50 | 31 3320 4d61 |
..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00 | r 2007
08:37.û.. \
binary 0
2. Descriptors
The descriptor has 5 components:
component length
typelist nvar
varlist nvar*33
srtlist nvar*2 + 2
fmtlist nvar*12
lbllist nvar*33
nvar = 4 in our case. The descriptor starts at byte 109, so let's fill
in the
-- in hex --
component length begin end begin end
typelist 4 109 112 6d 70
varlist 132 113 244 71 f4
srtlist 10 245 254 f5 fe
fmtlist 48 255 302 ff 12e
lbllist 132 303 434 12f 1b2
(by the way, I type in Stata -inbase 16 #- to convert from base 10 to
base 16. E.g., -inbase 16 109-.)
So here is the typlist:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
60 | fb 0203 | r 2007
08:37.û.. 70 | fb |
ûa.............. The types are
var. 1 fb = 251 -> byte
var. 2 2 = 2 -> str2
var. 3 3 = 3 -> str3
var. 4 fb = 251 -> byte
3. Variable labels
Each variable label is 81 bytes long. Variable labels start at byte 435:
-- in hex --
length begin end begin end
var. 1 81 435 515 1b3 203
var. 2 81 516 596 204 254
var. 3 81 597 677 255 2a5
var. 4 81 678 758 2a6 2f6
4. Expansion fields
The expansion field starts at byte 759 (2f7 base 16). The expansion
field contains
-- in hex --
length begin end begin
end -----------------------------------------------------
datatype byte 1 759 759 2f7 2f7
len 4 760 763 2f8 2fb
(and repeats)
Our dataset contains:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
2f0 | 00 0000 0000 |
.............x.. meaning datatype=0 and len=0, meaning there are no
expansion fields.
5. The data (at last!)
The data starts at byte 764 (hex 2fc). Each record is an observation,
is our case, is 1+2+3+1 = 7 bytes longs (see 2. Descriptors, above).
Thus, we have -- in hex --
length begin end begin end
obs 1. 7 764 770 2fc 302
obs 2. 7 771 777 303 309
Observation 1 is
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
2f0 | 0178 0000 |
.............x.. 300 | 0065 02 |
.e..yza... and observation 2 is
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
300 | 03 797a 6100 0004 | .e..yza...
Let's break apart observation 1:
type hex value meaning
var 1. byte 01 numeric 1
var 2. str2 7800 string 7800 = "x" (0 terminated)
var 3. str3 000065 string 000076 = "" (0 terminated)
var 4. byte 02 numeric 2
Note that var3 is 000076. The binary 0 is right up front, so the string
is "". the 0076 that follows is junk and ignorred.
Let's break apart observations 2:
type hex value meaning
var 1. byte 03 numeric 3
var 2. str2 797a string 797a = "yz" (not 0 terminated)
var 3. str3 610000 string 610000 = "a" (0 terminated)
var 4. byte 04 numeric 4
Note that var 2 is not zero terminated. If we were storing the string in
a language that required 0 termination (say C), we would code
mempcy(dest, bufpos, 2) ; dest[2] = '\0' ;
I hope this helps. Mark was worried that there was something about about
how strings appear in the .dta dataset. There is nothing strange except
for the lack of 0 termination when the string is full length, and 0
termination when less than full length.
Mark needs to -hexdump- his dataset and then include debug code in his
-- Bill
[email protected]
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/