Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Problem with infix: record too long
From
Barbara Guimarães <[email protected]>
To
[email protected]
Subject
Re: st: Problem with infix: record too long
Date
Mon, 25 Apr 2011 20:49:53 -0300
Nick, thanks for your response.
Using the type filename.txt as you suggested, Stata showed me the
following first lines:
type TS_QUEST_ALUNO.txt
1373262421RN24GROSSOS
2404408ADCDBAAACABDCABCEAAAAAAAAAAAAAAC*CBAABAAAAAA
1373263421RN24GROSSOS
2404408BDKEAAAABABDCADBDACAAAB.AAAAAAAABBBBACAAAAAA
1373264421RN24GROSSOS
2404408BAAACBAAB..DCADCDAAAAAABBAAAAAAAAABCAAAA.A..
and which than ended as:
> ......................................................................................................................................................................................................
> ............................................................................................................................................................c4 ......?.:Z3.
.R...x.9..........T.Np(0$%'...@#../q..'!m.t.F2$*J
It looks like, to me, that this would be a fixed format. But I might be wrong.
regards,
Barbara
2011/4/24 Nick Cox <[email protected]>:
> Your last question is, in effect, can I explain to you how to read a
> binary file with unspecified structure into Stata, and the short
> answer is sorry, no.
>
> It's a rare word processor that can open large binary files with
> success. Word processors accept a range of formats for documents,
> tending to prefer their own proprietary format, but are usually
> useless at reading binary data files. A good text editor could do it;
> that does not include the proprietary editors bundled with MS Windows.
>
> I wonder if you are being misled by the first line in the help for
> -infix- below, while overlooking the second line, which is vital.
>
> "infix reads into memory from a disk dataset that is not in Stata
> format. infix requires
> that the data be in fixed-column format."
>
> As you reported, Stata is seeing far fewer end-of-line character pairs
> \r\n than lines in this file, \r and \n characters are occurring by
> themselves, which is not standard for text files in MS Windows, and
> -hexdump- is labelling this binary. It' s unlikely to be wrong on
> that.
>
> You could try just
>
> . type filename.txt
>
> in Stata and that might show you, and us, the first few lines of the
> file. They might be recognisable to someone as in a particular format.
>
> I think if you can't get an idea of what the structure of this file
> is, then you have no way to read it into Stata. Why a "government
> organisation" is providing a binary file and calling a .txt I cannot
> explain. You may need to talk to them.
>
> Nick
>
> 2011/4/24 Barbara Guimarães <[email protected]>:
>> Dear Nick, unfortunetly, I'm not being able to open the file with any
>> word processor (I believe that it is because of its size / this
>> dataset was provided by an government organization, so I already
>> received it in .txt format and don't have access to the primary data)
>>
>>
>> However, the output of the hexdump analyze was:
>>
>>
>>>> . hexdump TS_QUEST_ALUNO.txt, analyze
>>
>>
>> Line-end characters Line
>> length (tab=1)
>>
>> \r\n (Windows) 2,517,361
>> minimum 0
>>
>> \r by itself (Mac) 686,626
>> maximum 20,971,542
>>
>> \n by itself (Unix) 768,441
>>
>> Space/separator characters Number of
>> lines 3,972,429
>>
>> [blank] 112,067,613
>> EOL at EOF? no
>>
>> [tab] 707,187
>>
>> [comma] (,) 765,547 Length
>> of first 5 lines
>>
>> Control characters
>> Line 1 120
>>
>> binary 0 30,611,037
>> Line 2 120
>>
>> CTL excl. \r, \n, \t 19,330,367
>> Line 3 120
>>
>> DEL 367,820
>> Line 4 120
>>
>> Extended (128-159,255) 21,370,596 Line 5
>> 120
>>
>> ASCII printable
>>
>> A-Z 149,642,323
>>
>> a-z 16,234,081
>> File format BINARY
>>
>> 0-9 53,967,247
>>
>> Special (!@#$ etc.) 28,963,365
>>
>> Extended (160-254) 54,882,559
>>
>> ---------------
>>
>> Total 495,399,531
>>
>>
>>
>> Observed were:
>>
>> \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
>>
>> ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5
>>
>> 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
>>
>> Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }
>>
>> ~ DEL 128 E^A E^B E^C E^D E^E E^F E^G E^H E^I E^J E^K E^L E^M E^N E^O
>>
>> E^P E^Q E^R E^S E^T E^U E^V E^W E^X E^Y E^Z 155 156 157 158 159 160 ¡ ¢
>>
>> £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ
>>
>> Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê
>>
>> ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ 255
>>
>>
>> Is there any way I could transform this dataset in a way Stata would
>> read it entirely?
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/