Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: insheet and dropping cases
From
"Ben Hoen" <[email protected]>
To
<[email protected]>
Subject
RE: st: insheet and dropping cases
Date
Thu, 20 Feb 2014 15:48:44 -0500
I found the workaround for changing the double quote to single:
filefilter IL.txt IL2.txt, from(\Q) to(\RQ) replace
Thank you all for helping me through this frustrating problem today.
As always, I really do not know what I would do without the brilliance and
helpfulness of this online community.
Cheers,
Ben
Ben Hoen
LBNL
Office: 845-758-1896
Cell: 718-812-7589
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Sergiy Radyakin
Sent: Thursday, February 20, 2014 3:21 PM
To: [email protected]
Subject: Re: st: insheet and dropping cases
Hello Ben,
report is helpful and it is safe to post it as it is Stata's output,
which doesn't have anything unprintable. Note how Stata writes an
escaped sequence \n and \r for unprintable characters 10 and 13. For
description of unprintable ASCII characters and their role in control
of the text see eg the following page:
http://www.juniper.net/techpubs/en_US/idp5.1/topics/reference/general/intrus
ion-detection-prevention-custom-attack-object-extended-ascii.html
or google them (plenty of links); most of them are archaic.
We focus on the 0-31 range. You only have 10 and 13, which is a
typical end-of-line pattern \r\n. There are no gremlins to zap, so to
speak. Also the /r and /n are having the same frequency, which means
that they are also likely to be properly paired at the end of the
line.
There is also nothing in the upper page (non-ASCII) characters 128-255.
To be sure that the report itself is correct, verify that the total
file length as reported by OS is the sum of frequencies of all
characters (394,625).
I note the use of the single and double quotes to denote minutes in
the coordinates. Perhaps this can confuse Stata. In some records you
posted I see "MIN" as a word, in some cases it is ". When seeing a
quote, even in a something-separated file, Stata would seek to the end
of the string, which could be a long way from where the quote has
opened. If you expect quotes to denote seconds and single quotes
minutes, do the filefilter for them in advance, and retry.
Hope this helps, Sergiy Radyakin
On Thu, Feb 20, 2014 at 2:19 PM, Ben Hoen <[email protected]> wrote:
> Hi Sergiy,
>
> I am pasting in the tabulate from hexdump (not knowing how to provide a
link
> to those files as you suggest):
>
> Tabulation (character not listed if unobserved):
> Dec Hex Char Frequency
> ------------------------------
> 010 0a \n 364
> 013 0d \r 364
> 032 20 blank 9,621
> 033 21 ! 9
> 034 22 " 5
> 035 23 # 21
> 038 26 & 202
> 039 27 ' 135
> 040 28 ( 30
> 041 29 ) 29
> 042 2a * 4
> 043 2b + 7
> 044 2c , 112
> 045 2d - 3,378
> 046 2e . 282
> 047 2f / 337
> 048 30 0 157,131
> 049 31 1 18,056
> 050 32 2 13,187
> 051 33 3 8,837
> 052 34 4 8,087
> 053 35 5 6,803
> 054 36 6 7,456
> 055 37 7 6,283
> 056 38 8 6,322
> 057 39 9 6,333
> 058 3a : 98
> 059 3b ; 24
> 064 40 @ 1
> 065 41 A 5,418
> 066 42 B 1,197
> 067 43 C 3,167
> 068 44 D 2,399
> 069 45 E 5,718
> 070 46 F 1,067
> 071 47 G 1,612
> 072 48 H 1,597
> 073 49 I 4,112
> 074 4a J 300
> 075 4b K 873
> 076 4c L 4,877
> 077 4d M 1,693
> 078 4e N 4,099
> 079 4f O 4,254
> 080 50 P 1,343
> 081 51 Q 149
> 082 52 R 4,634
> 083 53 S 3,272
> 084 54 T 3,756
> 085 55 U 1,162
> 086 56 V 865
> 087 57 W 1,488
> 088 58 X 151
> 089 59 Y 1,369
> 090 5a Z 67
> 095 5f _ 726
> 097 61 a 817
> 098 62 b 73
> 099 63 c 147
> 100 64 d 323
> 101 65 e 887
> 102 66 f 65
> 103 67 g 199
> 104 68 h 189
> 105 69 i 498
> 107 6b k 233
> 108 6c l 419
> 109 6d m 111
> 110 6e n 616
> 111 6f o 872
> 112 70 p 107
> 113 71 q 4
> 114 72 r 581
> 115 73 s 252
> 116 74 t 390
> 117 75 u 132
> 118 76 v 172
> 119 77 w 74
> 120 78 x 13
> 121 79 y 90
> 122 7a z 7
> 124 7c | 72,436
> 125 7d } 35
> ------------------------------
> Total 394,625
>
> It is not clear to me what the problem characters - unprintable/special or
> not - but I tried replacing the "}" character (and the comma previously)
to
> no avail.
>
> Separately I think I isolated the fields that contain the problems. Is
> there a way to ignore/remove individual fields in a txt file from within
> Stata?
>
> Thank you for your efforts in helping me with this issue.
>
> Ben
>
> Ben Hoen
> LBNL
> Office: 845-758-1896
> Cell: 718-812-7589
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Sergiy Radyakin
> Sent: Thursday, February 20, 2014 1:28 PM
> To: [email protected]
> Subject: Re: st: insheet and dropping cases
>
> Ben,
>
> -- the problem is likely caused by presence of unprintable characters
> in the file, that are tolerated by StatTransfer, but not by Stata;
>
> -- character with ASCII code 255 is a usual suspect;
>
> -- pasting raw data to statalist is likely not to reveal the problem,
> since the special characters might not survive massaging throw emails;
>
> -- isolating the problem in the text editor into a new file could help
> (keep the last record read in correctly and one immediately after),
> then make the file available through a link, to retain its binary
> structure, not all text editors will retain special chars on save;
>
> -- use hexdump "file" , analyze tabulate to see unprintable
> characters, then search for them in the file or use filefilter;
>
> -- see "zap gremlins" for relevant tactic.
>
> On the bright side: you are lucky you have 363 cases. Last time I had
> this problem, only 16gb out of 40gb were read in. Try to open that
> file in the notepad :)
>
> Hope this helps.
>
> Best, Sergiy Radyakin
>
>
> On Thu, Feb 20, 2014 at 12:34 PM, Radwin, David <[email protected]> wrote:
>> One other possibility is to use -inputst-, a Stata program that calls
> Stat/Transfer (part of -stcmd- by Roger Newson and available at SSC).
>>
>> This workaround is probably less computationally efficient than the
> suggestions from others, but since you already know that Stat/Transfer
> works, this approach might be faster and easier than trying to figure out
> the problem with your text files and -insheet- or -import delimited-.
>>
>> David
>> --
>> David Radwin, Senior Research Associate
>> Education and Workforce Development
>> RTI International
>> 2150 Shattuck Ave. Suite 800, Berkeley, CA 94704
>> Phone: 510-665-8274
>>
>> www.rti.org/education
>>
>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:owner-
>>> [email protected]] On Behalf Of Phil Schumm
>>> Sent: Thursday, February 20, 2014 6:38 AM
>>> To: Statalist Statalist
>>> Subject: Re: st: insheet and dropping cases
>>>
>>> On Feb 20, 2014, at 8:28 AM, Ben Hoen <[email protected]> wrote:
>>> > Hexdump I had never used. This is what it returned:
>>>
>>> <snip>
>>>
>>> > Do you see anything suspicious here? (I replaced all the commas with
>>> "_", using filefilter - another great suggestion - wondering if that
was
>>> causing any issues and insheet still returned 184 observations.)
>>>
>>>
>>> I don't see anything obvious -- you'll need to look at the file
directly.
>>> Is Stata reading the first 184 observations, or are the 184 observations
>>> from different places in the file? Check that first, and if you are
>>> getting the first 184 observations, then look at lines 184-6 (depending
> on
>>> whether the file has a header line). Something has to be going on
there.
>>>
>>>
>>> -- Phil
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/