Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: RE: Getting rid of binary codes so I can read in files - reposted
From
"David Radwin" <[email protected]>
To
<[email protected]>
Subject
st: RE: Getting rid of binary codes so I can read in files - reposted
Date
Wed, 18 Jan 2012 08:32:32 -0800 (PST)
Orian,
I've never used it myself, but you might try Google Refine:
http://www.stata.com/statalist/archive/2010-11/msg00858.html
http://code.google.com/p/google-refine/
Please let us know if it works for you or not.
David
--
David Radwin
Research Associate
MPR Associates, Inc.
2150 Shattuck Ave., Suite 800
Berkeley, CA 94704
Phone: 510-849-4942
Fax: 510-849-0794
www.mprinc.com
> -----Original Message-----
> From: [email protected] [mailto:owner-
> [email protected]] On Behalf Of Orian Brook
> Sent: Wednesday, January 18, 2012 6:40 AM
> To: [email protected]
> Subject: st: Getting rid of binary codes so I can read in files -
reposted
>
> Not lucky enough to have had any replies so far - is there anyone with
any
> suggestions, or shall I just revert to Outlook?
> Thanks
> Orian
>
> Dear all
> I'm analysing administrative data which I've had to export using an
online
> database into 105 files. I've previously worked with similar files by
> importing and combining them all in Outlook, then reading into stata
using
> an odbc link, but I'd really like to try to do it all in stata (so I
have
> the do file for repetition/audit trail purposes) but I have some
problems.
> The original files has extra EOL characters, and extended ones, which I
> can
> get rid of using filefilter, but I still can't import the file: using
> insheet I get the correct number of rows and columns, but all cells are
> blank except the first (it has a t in it). I've also tried using infile
> and
> skipping the first line, to no avail. Running hexdump shows that I have
> over
> 2million binary 0s, which I think may be the problem? I tried using the
> command "filefilter file1 file2, from(\00hd) to() replace" to get rid of
> them, but it hangs.
>
> Any help would be very gratefully received. The hexdump is below.
> (apologies, plain text format doesn't allow me to post this in courier
or
> something more legible)
>
> Regards
> Orian Brook
>
> Line-end characters Line length (tab=1)
> \r\n (Windows) 26,823 minimum 2
> \r by itself (Mac) 0 maximum 403
> \n by itself (Unix) 0
> Space/separator characters Number of lines 26,824
> [blank] 107,191 EOL at EOF? no
> [tab] 0
> [comma] (,) 509,637 Length of first 5 lines
> Control characters Line 1 403
> binary 0 2,747,580 Line 2 185
> CTL excl. \r, \n, \t 0 Line 3 243
> DEL 0 Line 4 245
> Extended (128-159,255) 0 Line 5 245
> ASCII printable
> A-Z 189,766
> a-z 189,754 File format BINARY
> 0-9 1,509,729
> Special (!@#$ etc.) 187,857
> Extended (160-254) 0
> ---------------
> Total 5,495,160
> Observed were:
> \0 \n \r blank , - . / 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L
M
> N
> O
> P Q R S T U V W X Y Z _ a b c d e f g h i k l m n o p q r s t u v x
y
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/