Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "William Gould, StataCorp LP" <wgould@stata.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Error 612 on .dta in Stata 13.1 |
Date | Mon, 09 Dec 2013 14:07:01 -0600 |
Sergiy Radyakin <serjradyakin@gmail.com> reports having two old .dta files that Stata 11 and 12 can -use- without problem, but that StataMP 13.1 refuses to read, instead saying . use "datafile.dta", clear .dta file corrupt The file unexpectedly ended before it should have. r(612); Sergiy is looking for advice and cannot share the data files. Sergiy used -hexdump- or something on the file and reports that they are specification 114, meaning they are from Stata 10. Why can Stata 11 and 12 read the data, but not Stata 13? -------------------------------------------------------- Stata 13 is far more demanding that .dta files match the expected format than any previous version of Stata. We changed the code and we changed the file format so that Stata could better determine when a problem arose. These are old files and so Stata 13 is more limited on the kinds of problems it can detect, but the code is still being more demanding. That is why stata 13 cannot read the files but Stata 11 and 12 can. An assumption I am making ------------------------- Sergiy can read the data using a previous version of Stata, he says. I am assuming that, using the OLD Stata, if Sergiy types . use <originaldataset> . save copy and then if Sergiy switches to Stata 13 and types . use copy the dataset loads without error. If that is not true, then either there is an bug in Stata 13 or the orignal dataset is corrupt, and just reading the corrupted dataset corrupted the OLD Stata session. At that point, Sergiy needs to talk to us, because we will want to determine which is the case. We can sign nondisclosure forms. How to determine how serious the error is ----------------------------------------- Let's assume that using and saving the original data with the OLD Stata results in a datset Stata 13 can read. Let me outline the process we would follow if Sergiy could send us the dataset: 1. In Stata 13, type -help dta-. Click on "114". Unfortunately, when I did that, I discovered a minor error in our help file. Further down, the file talks about "115" datasets even though I had clicked on 114. Do not panic. Stata 114 and 115 formats are identical. They differ only in that Stata 115 might contain %tb formats for date variables, whereas Stata 114 datasets cannot. 2. First, I want Sergiy to use -hexdump- to obtain the header. In Stata 13, type . set more on . log using <whatever> . hexdump <filename>.dta (Press -break- when screen fills up) . log close 3. Here is how you read the 114 and 115 formats: Byte 1: A byte contains two hexadecmial (base 16) digits. Thus, byte one contains two digits. Those two digits will be 0x72 or 0x73. When I write 0x in front of a number, I mean that the number is recorded in hexadecimal. What the byte actually contains -- and what the dump actually shows -- is "72" or "73". FYI, 0x72 = 114 and 0x73 = 115. That's how Sergiy knew the dataset format. Byte 2: Contains 0x01 or 0x02, meaining HILO or LOHI byte ordering, respectively. We are gong to need the byte order to interpet bytes 5-6 and 7-10 later. If the byte order is HILO, we can just read the numbers just as as they are written. If the byte order is LOHI, we will have to reverse the order of pairs of digits. I will explain when the problem arises. Byte 3: Contains 0x01. It always contains this when the dataset format is 114 or 115. Byte 4: Contains 0x00. It always contains this when the dataset format is 114 or 115. Bytes 5-6: contains a four-digit hexadecimal number. That four-digit number says how many variables are in the dataset. Let's pretend our file contains 0x0a0b. If the byte order (byte 2) is HILO, we can translate directly from base 16 to base 10: We have hex number a0b, we type -inten 16 a0b-, and learn the dataset contains 2,471 variables. If the byte order is LOHI, however, must must first reverse the bytes. Remember, each byte contains 2 digits. Thus, Thus (LOHI) 0x0a0b = (HILO) 0x0b0a. So we type -inten 16 b0a- and learn the dataset contains 2,826 variables. Bytes 7-10: contains an eight-digit hexadecmial number corresponding to the number of observations. Let's pretend out datset contains 0x0002fa03. Just as before, we can read it it from left-to-right if the byteorder is HILO. We type -inten 16 2fa03- and learn we have 195,075 observations. If numbers are stored in LOHI format, we must reverse the digits; (LOHI) 0002fa03 = (HILO) 03fa0200. We type -inten 16 3fa0200- and learn our dataset contains 66,716,160 observations. Okay, now we know the number of variables and number of observations the dataset SHOULD contain. Sergiy was able to read the dataset with a previous version of Stata. How many observations does the old Stata report? It needs to match or the dataset is corrupted. Now, look at the last observation. Type, . list in l In theory, it makes no difference whether Sergiy does this with an OLD Stata or Stata 13. If I were Sergiy, I'd do it both ways just for my own peace of mind. Anyway, look at the the last observation. Look especially at the end variables. Do they look correct? If they look correct, they probably are correct. Corrupt data usually looks corrupt because values will be out of range. A person's age won't randomly change from 48 to a number within the reasonable range for ages; it is more likely to randomly change to a number outside of that range because there are so many more of them. I'd probably trust the data if the last obsrvaiton looked good. More to do ---------- After the data, the next and last thing recorded in the 114 and 115 format datasets are the value labels. If the file was shortened, it is likely that not all value labels that should be defined are defined, and possibly the last value label does not have all the labels defined that it should. Here at StataCorp, we would do the following: . set more off . log using fulllog . hexdump <originalfile>.dta . log close and we would look at the end of the log. I am also wondering whether the file was not shortened, but accidentally lengthened, say by a mailer adding linefeed or carriage return and linefeed to the end of the file. Linefeed is 0x0a and carriage return 0x0d. Does the file end in 0x0d0a or in 0x0a? I hope this helps. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/