Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: reading a txt file that loops
From
David Kantor <[email protected]>
To
[email protected]
Subject
Re: st: reading a txt file that loops
Date
Sat, 16 Apr 2011 23:31:10 -0400
At 08:35 AM 4/16/2011, Sears Generic wrote:
Are there any shortcuts to reading a data file that has the following format
other than to reorganize the data before importing? The data file is for
population by year by geographic location (e.g. United States, Indiana, then
3 counties in Indiana). "FIPS" is a unique identifier for each county. The
problem is that the text file loops (i.e. only provides 4 decades of data
before starting over) on a new line. In the example below I've reduced the
issue to the United States, Indiana, and 3 counties, but the full dataset
has every county for every state so the looping does not recur in a
consistent way. Any suggestions would be appreciated.
FIPS 1990 1980 1970 1960
00000 248709873 226545805 203211926 179323175 United States
18000 5544159 5490224 5193669 4662498 Indiana
18001 31095 29619 26871 24643 Adams County
18003 300836 294335 280455 232196 Allen County
18005 63657 65088 57022 48198 Bartholomew County
FIPS 1950 1940 1930 1920
00000 151325798 132164569 12320262 106021537 United States
18000 3934224 3427796 3238503 2930390 Indiana
18001 22393 21254 19957 20503 Adams County
18003 183722 155084 146743 114303 Allen County
18005 36108 28276 24864 23887 Bartholomew County
It seems that you have data lines and header lines. You need to have
a dictionary that accommodates both kinds of lines.
Then as you read in the data, the variables for a data line will be
meaningless for header lines, and vice-versa.
Find a way to determine which line-type each record is. Maybe you
test whether the first variable is "FIPS" to signify a header line.
Furthermore, there are several different types of header lines; I see
two here, but maybe there are more. The two I see are...
1: for 1990, 1980, 1970, 1960
2: for 1950, 1940, 1930, 1920
Create a variable that indicates which type of header line is
present. Then carry that value forward over the subsequent data lines.
You can use -replace headertype = headertype [_n-1] if
mi(headertype) & ~mi(headertype[_n-1])-
-- assuming that headertype is initially missing for data lines.
You can also use -carryforward- from SSC.
Now save as a tempfile.
Loop through the headertypes; for each headertype ,
-use- the dataset (the tempfile) if headertype = the desired type
keep only the variables that pertain to data lines, plus headertype
based on headertype, rename the variables that contain the
population to something meaningful such as pop1990, pop1980, etc.
(And the numeric suffixes are necessary if you are to do the reshape
in the next step)
here, you may want to -reshape long-
-save- under a tempfile name that you can reconstruct later (e.g.,
t1, t2 for headertype 1 & 2).
Finally pull these files together. Loop through the headertypes; for
each headertype,
-use- the first of the latter tempfiles (maybe `t1')
for the remaining tempfiles...
-append- them if you did the -reshape long- step as mentioned above;
-merge- them otherwise.
HTH
--David
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/