An answer may be still be of interest, despite
Frank's later posting that he is going to grapple
with Python.
The best books on Perl almost all come from O'Reilly: see
http://perl.oreilly.com/
The original "Programming Perl" and one of several later
books "Learning Perl" are both now in their 3rd editions.
Perl is a wonderful thing: one problem now, however,
is that just to identify the subset which is going to
be of most use to you can take a fair while because
the whole language and all sorts of user-written
add-ons are in total very large (sound familiar?).
I have great admiration for Perl, and have
used it a bit, but I have an offbeat suggestion
which real Perl experts will sniff at.
They will say that what I am going to suggest
has long since been superseded by Perl -- and they
are in a sense totally right.
Awk.
There are two great advantages to Awk:
1. It is a small, compact language.
The original book by Aho, Kernighan
and Weinberger (permute AKW) is
still in print, very slim and very,
very good. Awk is a language which can be
learned quickly.
2. It has a very narrow view of the
world: its mindset is that it expects
to be looking at a text file line by
line. It can be made to do other things
fairly easily, but it is built largely
for that purpose -- which for Stata users
is of course very often exactly what you want.
Data files, program files, log files: all
have a definite line-based structure.
Yesterday before Frank's posting
I had this problem:
Meteorological observations here
at Durham have been made for >150 years
but over that time there have been many
changes in what has been measured.
The data come as a series of annual ASCII files
1849.dat, 1850.dat, ... with a series
of Stata dictionaries saying which
variables were measured in what years.
My starting point for one analysis is a
.do file reading in from each .dat file and ending with a
-save- to an annual .dta file.
The end point is one loop
use 1849
forval i = 1850/1997 {
append using `i'
}
which (eventually) took an eye's blink.
But in the middle there were
problems. With one year there was
a stream of error messages which
implied that Stata was seeing
fewer data items that it expected.
A call to Awk something like
awk " { print NF } " 1859.dat
printed 19 again and again as the number
of fields in each record in
a long stream, so that was true of
at least most of the lines in the file,
except that a test
awk " NF != 19 { print NR, NF } " 1859.dat
revealed a line with 18 fields: what
should have been an explicit missing
was in fact a blank, with knock-on
effects throughout the rest of the file.
What is going on here?
1. Awk is looking at each record
(by default a line) in the file specified.
2. " { print NF } "
is a complete Awk program. It has the form
" { <action> } "
and an <action> like this is automatically
executed for each record in the file.
3. NF is an example of a built-in
variable, and gives the number
of fields (by default fields
are separated by white space).
4. " NF != 19 { print NR, NF } "
is a complete Awk program. It has the form
" <pattern> { <action> } "
and <action> is executed if and only
if <pattern> is satisfied by a record. In this
case, if the number of fields is not 19,
the program prints the number of the record
(NR is another built-in variable) and the number of
fields on that record.
What's important here are not the details -- nor the
fact that there are other ways to tackle this,
including all-Stata solutions -- but the notion that
programs can be written on the fly both easily
and effectively.
Some other uses of Awk with Stata were written up as
STB-19 os13 . . . Using awk and fgrep for selective
extraction from log files 5/94 pp.15--17;
STB Reprints Vol 4, pp.78--80
explanation of how to use awk to selectively extract comments
from log files; explanation of how to use fgrep to selectively
extract lines from log files
An unorthodox introduction to an unorthodox
language is included in
A conversation on Awk. Computers & Geosciences
21, 1-6 (and 1119) (1995)
although despite my best efforts some typos
appear in the text.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/