| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: binary format type str question
Hello Mark,
if you haven't solved this problem yet, I would suggest that you use another
dataset to see if the problem is file-specific or code-specific.
E.g. try a trivial case -- a dataset with only one string variable and see
if your code can get it right. Alternatively try it with a publicly
available dataset, so that the statalisters can also have a look at it.
You have mentioned that you read all the data as one chunk. I would suggest
you reading data observation by observation, by defining a record structure
based on the file header.
If the size of the data area is different from what you expect, check if you
handle the Hi/Lo byte order correctly when you read the header.
Another hint is this C code to read Stata files (2002) by Thomas Lumley. It
is a part of the Foreign package for R. You can download it here:
http://cran.r-project.org/src/contrib/Descriptions/foreign.html
(choose package source in gz format even if you work in windows. Windows
binary archive does not contain the source code).
Below is a UUEncoded trivial file:
begin 644 test.dta
M<0(!``$`!@````!/FP`!````+/L$`3RY4P!@^+T`D/L$`9#[!`$@/%T`````
M`/S[!`%`GX,`9/L$`07IT7>=`___B`,#`(L!``````````````$````!````
MU3$T($UA<B`R,#`W(#$P.C,T``9V87(Q`'1E````````````````````````
M````````````````)3ES````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
L``````````!A``````!A8@````!A8F,```!A8F-D``!A8F-D90!A8F-D968`
`
end
sum -r/size 45278/314
Which looks in Stata as:
obs: 6
vars: 1 14 Mar 2007 10:34
size: 60 (99.9% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
var1 str6 %9s
-------------------------------------------------------------------------------
Sorted by:
And contains the following data:
. l,noo
+--------+
var1
--------
a
ab
abc
abcd
abcde
--------
abcdef
+--------+
If your code can correctly parse this file, then the dataset that you have
might be written in a different format.
Regards,
Sergiy
----- Original Message -----
From: "Mark Fisher" <[email protected]>
To: <[email protected]>
Sent: Tuesday, March 13, 2007 7:12 PM
Subject: Re: st: binary format type str question
Wow, thanks so much for your help. Let me say first that I don't have
access to Stata, so I can't do a -hexdump-. For reference, I'm using
http://www.stata.com/help.cgi?dta
Pretty much everything in this document (plus everything in your email)
makes sense to me. But I can't make the mapping between the typelist that
I get and what's in the data part of the file.
I've learned a bit more about the structure of the file in question.
I read the file (correctly, I think) right up to the point where the data
start. Then, in order to do some deconstrubtion, I simply read *all* the
remaining bytes in the file; there are only 1071 of them. Since there are
6 variables (with types 98, 136, 102, 105, 102, and 98) and 51
observations, I don't see how I can possibly account for all of them since
this only allows for 21 bytes per observation.
But a clear pattern emerges that if I partition the list of bytes into a
matrix of 51 rows and 21 columns. The first column contains byte values
running consecutively from 1 to 51 --- apparently an index encoded as the
byte value itself. (How do I make a correspondence between type 98 and
this variable?) The next two columns contain two characters: state
abbreviations (such as AL, AK, AZ, ...). (Again, how do I make a
correspondence between type 136 and this variable?) Then next 7 columns
(that is columns 3 to 9) are identical row by row: {0, 1, 12, 0, 0, 0,
64}. None of the remaining columns has identical rows. (Some of the
remaining columns have zeros in them.)
Anyway, that's where I stand. Is it possible this dta file was created in
a nonstandard way? (All the dta files I have are from Andrew Gelman's web
site for his new "Data Analysis" book. The one I can actually read says
"Written by R." in the data_label.) Are there other dta files available on
the web that I can experiment with?
--Mark.
William Gould, Stata wrote:
Mark Fisher <[email protected]> writes,
I'm writing a Mathematica program to read stata "dta" files. [...] I
have
Everything seems to work fine [...] But I can't figure out how to
properly
read the data when the data types are in the range 1 to 244 (str1, str2,
...
str244). [...]
David Kantor <[email protected]> speculated "that the string types are
stored
such that... they have a 0-byte terminator if they are shorter than the
maximal length of the type; they have no terminator othrwise".
That would have been my guessas to Mark's problem, too, but Mark says No.
I want to suggest Mark become familiar with Stata's -hexdump- command.
Here's an example I just did:
============================================================================
. describe
Contains data from example.dta
obs: 2 vars: 4
13 Mar 2007 08:37
size: 22 (99.9% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
a byte %8.0g b str2 %9s
c str3 %9s d byte
%8.0g -------------------------------------------------------------------------------
Sorted by: . list +----------------+
| a b c d |
|----------------|
1. | 1 x 2 |
2. | 3 yz a 4 |
+----------------+
. hexdump example.dta | |
character
| hex representation |
representation
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
0123456789abcdef
-----------------+-----------------------------------------+-----------------
0 | 7102 0100 0400 0200 0000 0000 0000 0300 |
q............... 10 | 0000 0000 0000 cc00 4500 0000 0000 0000 |
......Ì.E....... 20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa |
......¬Kf.....�ú 30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 |
²......±,....... | |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 |
................ 50 | 0000 4600 0000 0800 0000 0031 3320 4d61 |
..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00fb 0203 | r 2007
08:37.û.. 70 | fb61 0000 0000 0000 0000 0000 0000 0000 |
ûa.............. | |
80 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ 90 | 0000 6200 0000 0000 0000 0000 0000 0000 |
..b............. a0 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ b0 | 0000 0063 0000 0000 0000 0000 0000 0000 |
...c............ | |
c0 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ d0 | 0000 0000 6400 0000 0000 0000 0000 0000 |
....d........... e0 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ f0 | 0000 0000 0000 0000 0000 0000 0000 0025 |
...............% | |
100 | 382e 3067 0000 0000 0000 0025 3973 0000 |
8.0g.......%9s.. 110 | 0000 0000 0000 0025 3973 0000 0000 0000 |
.......%9s...... 120 | 0000 0025 382e 3067 0000 0000 0000 0000 |
...%8.0g........ 130 | 0000 0000 0000 0000 0000 0000 0000 0000 |
................ * | |
2f0 | 0000 0000 0000 0000 0000 0000 0178 0000 |
.............x.. 300 | 0065 0203 797a 6100 0004 |
.e..yza...
============================================================================
Let's work our way through this while looking at -help dta-
1. Header
----------
The first 109 bytes are header. 109 base 10 = 6d base 16. Here are
bytes 0 through 6c from the dump:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
0123456789abcdef
--------+-----------------------------------------+-----------------
0 | 7102 0100 0400 0200 0000 0000 0000 0300 |
q............... 10 | 0000 0000 0000 cc00 4500 0000 0000 0000 |
......Ì.E....... 20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa |
......¬Kf.....�ú 30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 |
²......±,....... | |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 |
................ 50 | 0000 4600 0000 0800 0000 0031 3320 4d61 |
..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00 | r 2007
08:37.û.. Mark can read this. Note that the data and the time stamp are
binary-0
terminated. For example, the time stamp is:
50 | 31 3320 4d61 |
..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00 | r 2007
08:37.û.. \
binary 0
2. Descriptors
---------------
The descriptor has 5 components:
component length
------------------------
typelist nvar
varlist nvar*33
srtlist nvar*2 + 2
fmtlist nvar*12
lbllist nvar*33
------------------------
nvar = 4 in our case. The descriptor starts at byte 109, so let's fill
in the
table:
-- in hex --
component length begin end begin end
-------------------------------------------------------------
typelist 4 109 112 6d 70
varlist 132 113 244 71 f4
srtlist 10 245 254 f5 fe
fmtlist 48 255 302 ff 12e
lbllist 132 303 434 12f 1b2
-------------------------------------------------------------
(by the way, I type in Stata -inbase 16 #- to convert from base 10 to
base 16. E.g., -inbase 16 109-.)
So here is the typlist:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
0123456789abcdef
--------+-----------------------------------------+-----------------
60 | fb 0203 | r 2007
08:37.û.. 70 | fb |
ûa.............. The types are
type
------------------------------
var. 1 fb = 251 -> byte
var. 2 2 = 2 -> str2
var. 3 3 = 3 -> str3
var. 4 fb = 251 -> byte
------------------------------
3. Variable labels
-------------------
Each variable label is 81 bytes long. Variable labels start at byte 435:
-- in hex --
length begin end begin end
--------------------------------------------------
var. 1 81 435 515 1b3 203
var. 2 81 516 596 204 254
var. 3 81 597 677 255 2a5
var. 4 81 678 758 2a6 2f6
---------------------------------------------------
4. Expansion fields
--------------------
The expansion field starts at byte 759 (2f7 base 16). The expansion
field contains
-- in hex --
length begin end begin
end -----------------------------------------------------
datatype byte 1 759 759 2f7 2f7
len 4 760 763 2f8 2fb
(and repeats)
-----------------------------------------------------
Our dataset contains:
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
0123456789abcdef
--------+-----------------------------------------+-----------------
2f0 | 00 0000 0000 |
.............x.. meaning datatype=0 and len=0, meaning there are no
expansion fields.
5. The data (at last!)
-----------------------
The data starts at byte 764 (hex 2fc). Each record is an observation,
which
is our case, is 1+2+3+1 = 7 bytes longs (see 2. Descriptors, above).
Thus, we have -- in hex --
length begin end begin end
--------------------------------------------------
obs 1. 7 764 770 2fc 302
obs 2. 7 771 777 303 309
--------------------------------------------------
Observation 1 is
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
0123456789abcdef
--------+-----------------------------------------+-----------------
2f0 | 0178 0000 |
.............x.. 300 | 0065 02 |
.e..yza... and observation 2 is
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f |
0123456789abcdef
--------+-----------------------------------------+-----------------
300 | 03 797a 6100 0004 | .e..yza...
Let's break apart observation 1:
type hex value meaning
------------------------------------------------------
var 1. byte 01 numeric 1
var 2. str2 7800 string 7800 = "x" (0 terminated)
var 3. str3 000065 string 000076 = "" (0 terminated)
var 4. byte 02 numeric 2
---------------------------------------------------
Note that var3 is 000076. The binary 0 is right up front, so the string
is "". the 0076 that follows is junk and ignorred.
Let's break apart observations 2:
type hex value meaning
--------------------------------------------------------------------
var 1. byte 03 numeric 3
var 2. str2 797a string 797a = "yz" (not 0 terminated)
var 3. str3 610000 string 610000 = "a" (0 terminated)
var 4. byte 04 numeric 4
--------------------------------------------------------------------
Note that var 2 is not zero terminated. If we were storing the string in
a language that required 0 termination (say C), we would code
mempcy(dest, bufpos, 2) ; dest[2] = '\0' ;
Conclusion
----------
I hope this helps. Mark was worried that there was something about about
how strings appear in the .dta dataset. There is nothing strange except
for the lack of 0 termination when the string is full length, and 0
termination when less than full length.
Mark needs to -hexdump- his dataset and then include debug code in his
program.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/