Joseph's method of reading in data as one
or more big string variables and then
applying -split- and -destring- is one
I often use myself for reproducing small example
datasets from Statalist postings. More precisely,
I copy and paste into the Editor; the embedded spaces
usually present then cause the Editor to treat the pasted
material as a string variable. I'd say the method
was useful so long as you knew exactly what you were doing.
I'd like to see a write-up of Joseph's regexp application!
But three small caveats:
1. =split- and -destring- are commands defined
by .ado files, so if this were to be used repeatedly
-- not something Joseph is recommending --
then it would be very inefficient because of the
overhead of interpretation.
2. Assuming something is born as -str244- and then
-compress-ing it could of course make huge demands
on storage, albeit briefly.
3. Most dangerous is the tacit assumption that the
1 + 244 * j th column (245, 489, etc.) doesn't find
you in the middle of a field.
Nick
[email protected]
Joseph Coveney
>
> Ronnie Babigumira wrote:
>
> Yes I have used and continue to use -insheet- (mainly if I have tab
> delimited data from excel) and specifying the string length
> is not a problem
> with -insheet-. That said, there are situations where -infile- is the
> appropriate command and of course -input- is invaluable when
> I want to input
> a few entries. In this case I have to specify the length of
> string variables
>
> Ada Ma wrote:
>
> have you tried -insheet-??
>
> <snip>
> Is there something in newer versions of Stata that would save me from
> guessing the length of strings when using -infile-
> and -input-
> </snip>
>
> --------------------------------------------------------------
> ------------------
>
> For -infile-:
>
> If your input file is space-delimited (that is, spaces aren't used to
> represent missing values and there aren't internal spaces in
> strings), then
> you can use -split- after -infix str v1 1-244 using
> <filename>- for record
> lengths up to 244 bytes. You can then -destring- to restore numeric
> variables.
>
> In cases of multispace-delimited files (typically used, for
> example, where
> there are internal spaces in strings), then I believe that
> you can specify a
> multispace parsing string with -split-. (See first example
> below.) Be
> aware that -infix- strips leading spaces at the beginning of the
> record; -filefilter- can help to remedy that beforehand if needed.
>
> In cases where the input file is messy, you can use Stata's
> conventional
> string functions and new regular expression functions after
> -infix str v1
> 1-244 using <filename>-. I've just finished such a project
> (the data were
> imbedded in prettily formatted .pdf files), and Stata's
> regular expression
> functions were a godsend.
>
> If the record length is longer than 244, then I believe that
> you can -infix
> str v1 1-244 str v2 245-488 . . . using <filename>-, and
> proceed as above.
>
> For -input-:
>
> You don't actually need to guess string length in order to
> use -input-.
> Just specify the maximum and away you go. (See second example below.)
>
> Joseph Coveney
>
>
> . set obs 2
> obs was 0, now 2
>
> . input str244 a
>
>
> >
> > a
> 1. "a b c d"
> 2. "e f g h i"
>
> . split a, generate(b) parse(" ")
> variables created as string:
> b1 b2 b3
>
> . list b*, noobs
>
> +----------------+
> | b1 b2 b3 |
> |----------------|
> | a b c d |
> | e f g h i |
> +----------------+
>
> . clear
>
> . input str244 a byte b str244 c int d str244 e float f
>
>
> >
> > a b
> >
> >
> > c d
> >
> >
> > e f
> 1. abc 3 def 200 ghi 1001.1
> 2. lmn -1 opq .m "" 10000
> 3. end
>
> . compress
> a was str244 now str3
> c was str244 now str3
> e was str244 now str3
>
> . list, noobs
>
> +-------------------------------------+
> | a b c d e f |
> |-------------------------------------|
> | abc 3 def 200 ghi 1001.1 |
> | lmn -1 opq .m 10000 |
> +-------------------------------------+
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/