Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Problem with infix: record too long
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Problem with infix: record too long
Date
Tue, 26 Apr 2011 01:11:32 +0100
Fixed format or not, I can't see a way for Stata to make sense out of that.
It's not uncommon for datafiles to start with some kind of preamble.
But this seems to start with some data. Also, the end looks quite
unlike the beginning, as might be guessed from the -hexdump- report.
Unless you can give more information on what should be inside --
you've not said, but you should know -- or someone recognises this
stuff, I think you need to ask those people what kind of beast they
sent.
2011/4/26 Barbara Guimarães <[email protected]>:
> Nick, thanks for your response.
>
> Using the type filename.txt as you suggested, Stata showed me the
> following first lines:
>
> type TS_QUEST_ALUNO.txt
> 1373262421RN24GROSSOS
> 2404408ADCDBAAACABDCABCEAAAAAAAAAAAAAAC*CBAABAAAAAA
> 1373263421RN24GROSSOS
> 2404408BDKEAAAABABDCADBDACAAAB.AAAAAAAABBBBACAAAAAA
> 1373264421RN24GROSSOS
> 2404408BAAACBAAB..DCADCDAAAAAABBAAAAAAAAABCAAAA.A..
>
> and which than ended as:
>
>> ......................................................................................................................................................................................................
>> ............................................................................................................................................................c4 ......?.:Z3.
> .R...x.9..........T.Np(0$%'...@#../q..'!m.t.F2$*J
>
> It looks like, to me, that this would be a fixed format. But I might be wrong.
>
> regards,
> Barbara
>
> 2011/4/24 Nick Cox <[email protected]>:
>> Your last question is, in effect, can I explain to you how to read a
>> binary file with unspecified structure into Stata, and the short
>> answer is sorry, no.
>>
>> It's a rare word processor that can open large binary files with
>> success. Word processors accept a range of formats for documents,
>> tending to prefer their own proprietary format, but are usually
>> useless at reading binary data files. A good text editor could do it;
>> that does not include the proprietary editors bundled with MS Windows.
>>
>> I wonder if you are being misled by the first line in the help for
>> -infix- below, while overlooking the second line, which is vital.
>>
>> "infix reads into memory from a disk dataset that is not in Stata
>> format. infix requires
>> that the data be in fixed-column format."
>>
>> As you reported, Stata is seeing far fewer end-of-line character pairs
>> \r\n than lines in this file, \r and \n characters are occurring by
>> themselves, which is not standard for text files in MS Windows, and
>> -hexdump- is labelling this binary. It' s unlikely to be wrong on
>> that.
>>
>> You could try just
>>
>> . type filename.txt
>>
>> in Stata and that might show you, and us, the first few lines of the
>> file. They might be recognisable to someone as in a particular format.
>>
>> I think if you can't get an idea of what the structure of this file
>> is, then you have no way to read it into Stata. Why a "government
>> organisation" is providing a binary file and calling a .txt I cannot
>> explain. You may need to talk to them.
>>
>> Nick
>>
>> 2011/4/24 Barbara Guimarães <[email protected]>:
>>> Dear Nick, unfortunetly, I'm not being able to open the file with any
>>> word processor (I believe that it is because of its size / this
>>> dataset was provided by an government organization, so I already
>>> received it in .txt format and don't have access to the primary data)
>>>
>>>
>>> However, the output of the hexdump analyze was:
>>>
>>>
>>>>> . hexdump TS_QUEST_ALUNO.txt, analyze
>>>
>>>
>>> Line-end characters Line
>>> length (tab=1)
>>>
>>> \r\n (Windows) 2,517,361
>>> minimum 0
>>>
>>> \r by itself (Mac) 686,626
>>> maximum 20,971,542
>>>
>>> \n by itself (Unix) 768,441
>>>
>>> Space/separator characters Number of
>>> lines 3,972,429
>>>
>>> [blank] 112,067,613
>>> EOL at EOF? no
>>>
>>> [tab] 707,187
>>>
>>> [comma] (,) 765,547 Length
>>> of first 5 lines
>>>
>>> Control characters
>>> Line 1 120
>>>
>>> binary 0 30,611,037
>>> Line 2 120
>>>
>>> CTL excl. \r, \n, \t 19,330,367
>>> Line 3 120
>>>
>>> DEL 367,820
>>> Line 4 120
>>>
>>> Extended (128-159,255) 21,370,596 Line 5
>>> 120
>>>
>>> ASCII printable
>>>
>>> A-Z 149,642,323
>>>
>>> a-z 16,234,081
>>> File format BINARY
>>>
>>> 0-9 53,967,247
>>>
>>> Special (!@#$ etc.) 28,963,365
>>>
>>> Extended (160-254) 54,882,559
>>>
>>> ---------------
>>>
>>> Total 495,399,531
>>>
>>>
>>>
>>> Observed were:
>>>
>>> \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
>>>
>>> ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5
>>>
>>> 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
>>>
>>> Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }
>>>
>>> ~ DEL 128 E^A E^B E^C E^D E^E E^F E^G E^H E^I E^J E^K E^L E^M E^N E^O
>>>
>>> E^P E^Q E^R E^S E^T E^U E^V E^W E^X E^Y E^Z 155 156 157 158 159 160 ¡ ¢
>>>
>>> £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ
>>>
>>> Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê
>>>
>>> ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ 255
>>>
>>>
>>> Is there any way I could transform this dataset in a way Stata would
>>> read it entirely?
>>>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/