Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Conditional infile statements
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Conditional infile statements
Date
Sun, 20 Nov 2011 13:57:35 +0000
-strparse- (SSC) has been superseded long since by the official
command -split-. However, that I think may not help Gordon much, but
reading in as one string variable and then some mix of -split- and
-reshape- might help.
I also would commend -file- here, or Unix utilities such as awk
(available in ports to Windows).
Nick
On Sun, Nov 20, 2011 at 12:40 PM, Gordon Hughes <[email protected]> wrote:
> I would like to read a *very* large dataset using conditional infile
> statements. With some oversimplification the structure of the data is as
> follows:
>
> Line 1: type1 id 1 2 3 4 5
> Line 2: type1 id 3 4 5 6 7
> Line 3: type2 id ABC DEF FGH
> Line 4: type1 id 5 6 7 8 9
> Line 5: type3 id IJK 3 4 XYZ
> ...
>
> The format of the data on each line is fixed but the formatting varies
> according the value of the first variable on the line. For practical
> purposes the data may be treated as having one line per observation but with
> different variables recorded for the different line types. There is no
> consistent pattern of the occurrence of lines of different types.
>
> In high level programming languages, SAS and some other languages it is
> possible to read such data using the following generic code:
>
> read str ltype @
> if ltype=="type1" {read id str type var1-var5}
> if ltype=="type2" {read id str type str char1 str char2 str char3}
> if ltype=="type3" {read id str char4 var6 var7 str char5}
>
> where the @ character holds the current line for re-reading. As far as I
> can work out this is not possible, at least directly, in Stata.
>
> In fact the problem is even worse than this description implies because many
> of the variables have the form "123*" where 123 is a value and "*" may or
> may not be present and indicates a flag or note.
>
> There is a way of doing this but to my mind it is clumsy:
>
> infix str sline 1-30 using ...
> gen ltype=substr(sline, 1, 5)
> gen var1=real(substr(sline, 6, 2)) if ltype=="type1"
> ....
>
> The user-written routine -strparse- can also be deployed for free format
> data, but again it involves the use of sub-string manipulation. I cannot
> locate any other user-written routine which provides a better way of doing
> this, but my -net search- terms may not pick up the right keywords.
>
> I would appreciate any suggestions as to a better way of doing this - or
> should I just resign myself to writing the code required to parse each line.
> (Incidentally, one reason for my reluctance to do this is that it increases
> the maximum memory size required to hold the initial pass through the data.)
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/