Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: insheet and dropping cases
From
Sergiy Radyakin <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: insheet and dropping cases
Date
Thu, 20 Feb 2014 21:28:38 -0500
Dear Ben,
with such a replace the offending text (quoted earlier)
N 89 DEG 46'47" E
will turn into
N 89 DEG 46'47' E
which might cause problems later if you ever need to recover the exact
coordinates from imported data.
Other records that you exposed were already in DEG-MIN-SEC notation:
N 00 DEG 50 MIN 54 SEC W 431. 78 FT, N 47 DEG 30 MIN W 522.2 6 FT
With that I would probably first convert the data replacing double
quotes into something bright and shiny like #@#@#@ (literally these
symbols, don't take me wrong), then asserting that this sequence
appears only in the variable holding the coordinates, then going back
to original files and apply -filefilter- to replace double quotes with
SEC to match the alternative coordinates notation already present in
the file. With that you get a cleaner file and solve the importing
problem.
Alternatively the standalone converter tab2dta.exe (shameless
self-promotion) is tolerant to unbalanced quotes within string values,
but requires tab character as a separator:
http://radyakin.org/transfer/tab2dta/tab2dta.htm
Replacing pipes with tabs should be straightforward with -filefilter-.
With this approach you distort the separators, leaving the content
intact, while with the first approach you distort the content,
retaining the separators intact. Once data is read in into Stata,
separators don't exist anymore, so nobody is hurt if they were
transformed in the process. Content alterations might matter.
Best, Sergiy Radyakin
On Thu, Feb 20, 2014 at 3:48 PM, Ben Hoen <[email protected]> wrote:
> I found the workaround for changing the double quote to single:
>
> filefilter IL.txt IL2.txt, from(\Q) to(\RQ) replace
>
> Thank you all for helping me through this frustrating problem today.
>
> As always, I really do not know what I would do without the brilliance and
> helpfulness of this online community.
>
> Cheers,
>
> Ben
>
> Ben Hoen
> LBNL
> Office: 845-758-1896
> Cell: 718-812-7589
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Sergiy Radyakin
> Sent: Thursday, February 20, 2014 3:21 PM
> To: [email protected]
> Subject: Re: st: insheet and dropping cases
>
> Hello Ben,
> report is helpful and it is safe to post it as it is Stata's output,
> which doesn't have anything unprintable. Note how Stata writes an
> escaped sequence \n and \r for unprintable characters 10 and 13. For
> description of unprintable ASCII characters and their role in control
> of the text see eg the following page:
> http://www.juniper.net/techpubs/en_US/idp5.1/topics/reference/general/intrus
> ion-detection-prevention-custom-attack-object-extended-ascii.html
> or google them (plenty of links); most of them are archaic.
>
> We focus on the 0-31 range. You only have 10 and 13, which is a
> typical end-of-line pattern \r\n. There are no gremlins to zap, so to
> speak. Also the /r and /n are having the same frequency, which means
> that they are also likely to be properly paired at the end of the
> line.
>
> There is also nothing in the upper page (non-ASCII) characters 128-255.
>
> To be sure that the report itself is correct, verify that the total
> file length as reported by OS is the sum of frequencies of all
> characters (394,625).
>
> I note the use of the single and double quotes to denote minutes in
> the coordinates. Perhaps this can confuse Stata. In some records you
> posted I see "MIN" as a word, in some cases it is ". When seeing a
> quote, even in a something-separated file, Stata would seek to the end
> of the string, which could be a long way from where the quote has
> opened. If you expect quotes to denote seconds and single quotes
> minutes, do the filefilter for them in advance, and retry.
>
> Hope this helps, Sergiy Radyakin
>
>
>
>
>
> On Thu, Feb 20, 2014 at 2:19 PM, Ben Hoen <[email protected]> wrote:
>> Hi Sergiy,
>>
>> I am pasting in the tabulate from hexdump (not knowing how to provide a
> link
>> to those files as you suggest):
>>
>> Tabulation (character not listed if unobserved):
>> Dec Hex Char Frequency
>> ------------------------------
>> 010 0a \n 364
>> 013 0d \r 364
>> 032 20 blank 9,621
>> 033 21 ! 9
>> 034 22 " 5
>> 035 23 # 21
>> 038 26 & 202
>> 039 27 ' 135
>> 040 28 ( 30
>> 041 29 ) 29
>> 042 2a * 4
>> 043 2b + 7
>> 044 2c , 112
>> 045 2d - 3,378
>> 046 2e . 282
>> 047 2f / 337
>> 048 30 0 157,131
>> 049 31 1 18,056
>> 050 32 2 13,187
>> 051 33 3 8,837
>> 052 34 4 8,087
>> 053 35 5 6,803
>> 054 36 6 7,456
>> 055 37 7 6,283
>> 056 38 8 6,322
>> 057 39 9 6,333
>> 058 3a : 98
>> 059 3b ; 24
>> 064 40 @ 1
>> 065 41 A 5,418
>> 066 42 B 1,197
>> 067 43 C 3,167
>> 068 44 D 2,399
>> 069 45 E 5,718
>> 070 46 F 1,067
>> 071 47 G 1,612
>> 072 48 H 1,597
>> 073 49 I 4,112
>> 074 4a J 300
>> 075 4b K 873
>> 076 4c L 4,877
>> 077 4d M 1,693
>> 078 4e N 4,099
>> 079 4f O 4,254
>> 080 50 P 1,343
>> 081 51 Q 149
>> 082 52 R 4,634
>> 083 53 S 3,272
>> 084 54 T 3,756
>> 085 55 U 1,162
>> 086 56 V 865
>> 087 57 W 1,488
>> 088 58 X 151
>> 089 59 Y 1,369
>> 090 5a Z 67
>> 095 5f _ 726
>> 097 61 a 817
>> 098 62 b 73
>> 099 63 c 147
>> 100 64 d 323
>> 101 65 e 887
>> 102 66 f 65
>> 103 67 g 199
>> 104 68 h 189
>> 105 69 i 498
>> 107 6b k 233
>> 108 6c l 419
>> 109 6d m 111
>> 110 6e n 616
>> 111 6f o 872
>> 112 70 p 107
>> 113 71 q 4
>> 114 72 r 581
>> 115 73 s 252
>> 116 74 t 390
>> 117 75 u 132
>> 118 76 v 172
>> 119 77 w 74
>> 120 78 x 13
>> 121 79 y 90
>> 122 7a z 7
>> 124 7c | 72,436
>> 125 7d } 35
>> ------------------------------
>> Total 394,625
>>
>> It is not clear to me what the problem characters - unprintable/special or
>> not - but I tried replacing the "}" character (and the comma previously)
> to
>> no avail.
>>
>> Separately I think I isolated the fields that contain the problems. Is
>> there a way to ignore/remove individual fields in a txt file from within
>> Stata?
>>
>> Thank you for your efforts in helping me with this issue.
>>
>> Ben
>>
>> Ben Hoen
>> LBNL
>> Office: 845-758-1896
>> Cell: 718-812-7589
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Sergiy Radyakin
>> Sent: Thursday, February 20, 2014 1:28 PM
>> To: [email protected]
>> Subject: Re: st: insheet and dropping cases
>>
>> Ben,
>>
>> -- the problem is likely caused by presence of unprintable characters
>> in the file, that are tolerated by StatTransfer, but not by Stata;
>>
>> -- character with ASCII code 255 is a usual suspect;
>>
>> -- pasting raw data to statalist is likely not to reveal the problem,
>> since the special characters might not survive massaging throw emails;
>>
>> -- isolating the problem in the text editor into a new file could help
>> (keep the last record read in correctly and one immediately after),
>> then make the file available through a link, to retain its binary
>> structure, not all text editors will retain special chars on save;
>>
>> -- use hexdump "file" , analyze tabulate to see unprintable
>> characters, then search for them in the file or use filefilter;
>>
>> -- see "zap gremlins" for relevant tactic.
>>
>> On the bright side: you are lucky you have 363 cases. Last time I had
>> this problem, only 16gb out of 40gb were read in. Try to open that
>> file in the notepad :)
>>
>> Hope this helps.
>>
>> Best, Sergiy Radyakin
>>
>>
>> On Thu, Feb 20, 2014 at 12:34 PM, Radwin, David <[email protected]> wrote:
>>> One other possibility is to use -inputst-, a Stata program that calls
>> Stat/Transfer (part of -stcmd- by Roger Newson and available at SSC).
>>>
>>> This workaround is probably less computationally efficient than the
>> suggestions from others, but since you already know that Stat/Transfer
>> works, this approach might be faster and easier than trying to figure out
>> the problem with your text files and -insheet- or -import delimited-.
>>>
>>> David
>>> --
>>> David Radwin, Senior Research Associate
>>> Education and Workforce Development
>>> RTI International
>>> 2150 Shattuck Ave. Suite 800, Berkeley, CA 94704
>>> Phone: 510-665-8274
>>>
>>> www.rti.org/education
>>>
>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:owner-
>>>> [email protected]] On Behalf Of Phil Schumm
>>>> Sent: Thursday, February 20, 2014 6:38 AM
>>>> To: Statalist Statalist
>>>> Subject: Re: st: insheet and dropping cases
>>>>
>>>> On Feb 20, 2014, at 8:28 AM, Ben Hoen <[email protected]> wrote:
>>>> > Hexdump I had never used. This is what it returned:
>>>>
>>>> <snip>
>>>>
>>>> > Do you see anything suspicious here? (I replaced all the commas with
>>>> "_", using filefilter - another great suggestion - wondering if that
> was
>>>> causing any issues and insheet still returned 184 observations.)
>>>>
>>>>
>>>> I don't see anything obvious -- you'll need to look at the file
> directly.
>>>> Is Stata reading the first 184 observations, or are the 184 observations
>>>> from different places in the file? Check that first, and if you are
>>>> getting the first 184 observations, then look at lines 184-6 (depending
>> on
>>>> whether the file has a header line). Something has to be going on
> there.
>>>>
>>>>
>>>> -- Phil
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/