Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Error 612 on .dta in Stata 13.1
From
Sergiy Radyakin <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: Error 612 on .dta in Stata 13.1
Date
Mon, 9 Dec 2013 19:09:18 -0500
Dear James and Bill,
thank you very much for your advice! The problem appears to be a
combination of all of the following:
1) data file being truncated, and
2) data file being corrupt within the remaining length, and
3) tolerance of earlier Statas to the data problem of truncated file
and its non-transparent handling of such corrupt files.
I've put a more verbose report here:
http://www.radyakin.org/statalist/statabugs/incomplete_f.htm
Ironically, I requested this behavior back in 2011. But it didn't
occur to me until after a couple of hours after I posted the email.
James, I am afraid unfolding Pyton on the server would not be an
immediate possibility for me, but I will one day try it with your
command. If you are unsure of how it is going to react, perhaps try it
with the replication script from here:
http://www.radyakin.org/statalist/statabugs/incomplete_file.htm
Best, Sergiy Radyakin
On Mon, Dec 9, 2013 at 3:07 PM, William Gould, StataCorp LP
<[email protected]> wrote:
> Sergiy Radyakin <[email protected]> reports having two old .dta
> files that Stata 11 and 12 can -use- without problem, but that StataMP
> 13.1 refuses to read, instead saying
>
> . use "datafile.dta", clear
> .dta file corrupt
> The file unexpectedly ended before it should have.
> r(612);
>
> Sergiy is looking for advice and cannot share the data files.
>
> Sergiy used -hexdump- or something on the file and reports that they
> are specification 114, meaning they are from Stata 10.
>
>
> Why can Stata 11 and 12 read the data, but not Stata 13?
> --------------------------------------------------------
>
> Stata 13 is far more demanding that .dta files match the expected
> format than any previous version of Stata. We changed the code and we
> changed the file format so that Stata could better determine when a
> problem arose.
>
> These are old files and so Stata 13 is more limited on the kinds of
> problems it can detect, but the code is still being more demanding.
>
> That is why stata 13 cannot read the files but Stata 11 and 12 can.
>
>
> An assumption I am making
> -------------------------
>
> Sergiy can read the data using a previous version of Stata, he says. I
> am assuming that, using the OLD Stata, if Sergiy types
>
> . use <originaldataset>
>
> . save copy
>
> and then if Sergiy switches to Stata 13 and types
>
> . use copy
>
> the dataset loads without error. If that is not true, then either
> there is an bug in Stata 13 or the orignal dataset is corrupt, and
> just reading the corrupted dataset corrupted the OLD Stata session.
>
> At that point, Sergiy needs to talk to us, because we will want to
> determine which is the case. We can sign nondisclosure forms.
>
>
> How to determine how serious the error is
> -----------------------------------------
>
> Let's assume that using and saving the original data with the OLD Stata
> results in a datset Stata 13 can read.
>
> Let me outline the process we would follow if Sergiy could send us the
> dataset:
>
> 1. In Stata 13, type -help dta-. Click on "114".
> Unfortunately, when I did that, I discovered a minor error in
> our help file. Further down, the file talks about "115"
> datasets even though I had clicked on 114.
>
> Do not panic. Stata 114 and 115 formats are identical. They
> differ only in that Stata 115 might contain %tb formats for
> date variables, whereas Stata 114 datasets cannot.
>
> 2. First, I want Sergiy to use -hexdump- to obtain the header.
> In Stata 13, type
>
> . set more on
> . log using <whatever>
> . hexdump <filename>.dta
> (Press -break- when screen fills up)
> . log close
>
> 3. Here is how you read the 114 and 115 formats:
>
> Byte 1: A byte contains two hexadecmial (base 16) digits.
> Thus, byte one contains two digits.
>
> Those two digits will be 0x72 or 0x73. When I write 0x in
> front of a number, I mean that the number is recorded in
> hexadecimal. What the byte actually contains -- and what
> the dump actually shows -- is "72" or "73".
>
> FYI, 0x72 = 114 and 0x73 = 115. That's how Sergiy knew the
> dataset format.
>
> Byte 2: Contains 0x01 or 0x02, meaining HILO or LOHI byte
> ordering, respectively. We are gong to need the byte order
> to interpet bytes 5-6 and 7-10 later. If the byte order is
> HILO, we can just read the numbers just as as they are
> written. If the byte order is LOHI, we will have to
> reverse the order of pairs of digits. I will explain when
> the problem arises.
>
> Byte 3: Contains 0x01. It always contains this when the dataset
> format is 114 or 115.
>
> Byte 4: Contains 0x00. It always contains this when the dataset
> format is 114 or 115.
>
> Bytes 5-6: contains a four-digit hexadecimal number. That
> four-digit number says how many variables are in the
> dataset.
>
> Let's pretend our file contains 0x0a0b.
>
> If the byte order (byte 2) is HILO, we can translate
> directly from base 16 to base 10: We have hex number
> a0b, we type -inten 16 a0b-, and learn the dataset
> contains 2,471 variables.
>
> If the byte order is LOHI, however, must must first reverse
> the bytes. Remember, each byte contains 2 digits. Thus,
> Thus (LOHI) 0x0a0b = (HILO) 0x0b0a. So we type -inten 16 b0a-
> and learn the dataset contains 2,826 variables.
>
> Bytes 7-10: contains an eight-digit hexadecmial number
> corresponding to the number of observations.
>
> Let's pretend out datset contains 0x0002fa03.
>
> Just as before, we can read it it from left-to-right if
> the byteorder is HILO. We type -inten 16 2fa03- and learn
> we have 195,075 observations.
>
> If numbers are stored in LOHI format, we must reverse
> the digits; (LOHI) 0002fa03 = (HILO) 03fa0200.
> We type -inten 16 3fa0200- and learn our dataset contains
> 66,716,160 observations.
>
> Okay, now we know the number of variables and number of observations the
> dataset SHOULD contain.
>
> Sergiy was able to read the dataset with a previous version of Stata.
>
> How many observations does the old Stata report? It needs to match
> or the dataset is corrupted.
>
> Now, look at the last observation. Type,
>
> . list in l
>
> In theory, it makes no difference whether Sergiy does this with an OLD
> Stata or Stata 13. If I were Sergiy, I'd do it both ways just for my
> own peace of mind.
>
> Anyway, look at the the last observation. Look especially at the end
> variables. Do they look correct? If they look correct, they probably
> are correct. Corrupt data usually looks corrupt because values will be
> out of range. A person's age won't randomly change from 48 to a number
> within the reasonable range for ages; it is more likely to randomly
> change to a number outside of that range because there are so many more
> of them.
>
> I'd probably trust the data if the last obsrvaiton looked good.
>
>
> More to do
> ----------
>
> After the data, the next and last thing recorded in the 114 and 115 format
> datasets are the value labels.
>
> If the file was shortened, it is likely that not all value labels that
> should be defined are defined, and possibly the last value label does not
> have all the labels defined that it should.
>
> Here at StataCorp, we would do the following:
>
> . set more off
> . log using fulllog
> . hexdump <originalfile>.dta
> . log close
>
> and we would look at the end of the log.
>
> I am also wondering whether the file was not shortened, but
> accidentally lengthened, say by a mailer adding linefeed or carriage
> return and linefeed to the end of the file. Linefeed is 0x0a and
> carriage return 0x0d.
>
> Does the file end in 0x0d0a or in 0x0a?
>
> I hope this helps.
>
> -- Bill
> [email protected]
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/