Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Conditional infile statements |
Date | Sun, 20 Nov 2011 13:57:35 +0000 |
-strparse- (SSC) has been superseded long since by the official command -split-. However, that I think may not help Gordon much, but reading in as one string variable and then some mix of -split- and -reshape- might help. I also would commend -file- here, or Unix utilities such as awk (available in ports to Windows). Nick On Sun, Nov 20, 2011 at 12:40 PM, Gordon Hughes <G.A.Hughes@ed.ac.uk> wrote: > I would like to read a *very* large dataset using conditional infile > statements. With some oversimplification the structure of the data is as > follows: > > Line 1: type1 id 1 2 3 4 5 > Line 2: type1 id 3 4 5 6 7 > Line 3: type2 id ABC DEF FGH > Line 4: type1 id 5 6 7 8 9 > Line 5: type3 id IJK 3 4 XYZ > ... > > The format of the data on each line is fixed but the formatting varies > according the value of the first variable on the line. For practical > purposes the data may be treated as having one line per observation but with > different variables recorded for the different line types. There is no > consistent pattern of the occurrence of lines of different types. > > In high level programming languages, SAS and some other languages it is > possible to read such data using the following generic code: > > read str ltype @ > if ltype=="type1" {read id str type var1-var5} > if ltype=="type2" {read id str type str char1 str char2 str char3} > if ltype=="type3" {read id str char4 var6 var7 str char5} > > where the @ character holds the current line for re-reading. As far as I > can work out this is not possible, at least directly, in Stata. > > In fact the problem is even worse than this description implies because many > of the variables have the form "123*" where 123 is a value and "*" may or > may not be present and indicates a flag or note. > > There is a way of doing this but to my mind it is clumsy: > > infix str sline 1-30 using ... > gen ltype=substr(sline, 1, 5) > gen var1=real(substr(sline, 6, 2)) if ltype=="type1" > .... > > The user-written routine -strparse- can also be deployed for free format > data, but again it involves the use of sub-string manipulation. I cannot > locate any other user-written routine which provides a better way of doing > this, but my -net search- terms may not pick up the right keywords. > > I would appreciate any suggestions as to a better way of doing this - or > should I just resign myself to writing the code required to parse each line. > (Incidentally, one reason for my reluctance to do this is that it increases > the maximum memory size required to hold the initial pass through the data.) > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/