Simon Wheeler <[email protected]> has a data-input question:
> I have recently purchased a dataset which contains food codes in 7-bit
> ASCII format [...]
>
> The data as they stand fall into 2 columns: varname|value. NUMB is the
> food code - i.e. the variable that I want to sort the data by.
> would like to have all of the other variables along the top as
> separate columns, so that I end up with individual food codes and their
> nutrient values - see below. With these data I can then do dietary
> analysis.
>
> This format goes on for several thousand different food codes, over and
> over. Codes are separated by the line ***
>
> How it is now:
>
> RECTYPE 1
> NUMB 14002
> CONST 50852
> PREV 50852
> GROUP FA
> NAME Apples, cooking, raw, peeled
> EDPOR 1
> WATER 87.7
> TOTNIT 0.05
> PROT 0.3
> FAT 0.1
> CHO 8.9
> KCALS 35
> KJ 151
> STAR Tr
> TOTSUG 8.9
> GLUC 2
> FRUCT 5.9
> SUCR 1
> MALT 0
> LACT 0
> SOUTHFIB 2.2
> ENGFIB 1.6
> CELL 0.6
> SOLNCP 0.6
> INSOLNCP 0.4
> LIGNIN Tr
> NA 2
> ***
> RECTYPE 1
> NUMB 14003
> CONST 50853
> PREV 50853
> GROUP FA
> NAME Apples, cooking, weighed with skin and core
> EDPOR 0.73
> WATER 63.1
>
> etc
>
> What I would like to have:
>
> Code | Description | Group | Water | Fat | CHO | Protein | Sodium | Potassium
>
> 14001| Apples | Fruit | 26 | 5.1 | 4.2 | 3.6 | 0.1 | 0.04
> 14002| Oranges | Fruit | etc
There are two appraches to problems like this -- one uses Stata and the other
uses Mata -- but the underlying logic is the same either way.
Let's use Stata. One begins by reading each line of the data into a string
variable. We then give Stata commands to, step by step, move the data
into the form we want. The wonderful thing about this approach is that
we never have to form a grand plan. We just look at what we've got,
fix some problem and move a little closer, and repeat the process
until we arrive.
Simon ultimately wants the data in wide form. Given the format of
this data, however, it will easiest to get the data into long form and
then turn that into wide form in the standard way. So let's set the
following goal for ourselves, to get the data into the form:
code description grouip category value
14001 Apples Fruit Water 26
14001 Apples Fruit Fat 5.1
14001 Apples Fruit CHO 4.2
...
14001 Oranges Fruit etc.
Step 1. Infile the data
------------------------
Here is what we do:
. infix str line 1-80 using <filename>
. compress
line was str80 now str41
The second line just shortens variable line. I just made up the
"line was str80 not str41" part. However, if -compress- produces no
message, then the longest line was 80 characters, and that means the
real maximum was probabliy longer than 80. In that case, redo the -infix-
with a longer string.
Anyway, now we have a dataset that looks like,
line
---------------------------------------------
1. RECTYPE 1
2. NUMB 14002
3. CONST 50852
4. PREV 50852
5. GROUP FA
6. NAME Apples, cooking, raw, peeled
7. EDPOR 1
8. WATER 87.7
9. etc.
---------------------------------------------
If we list the data in Stata, the lines will look right justified, but
that's because of the display format. Let's change it. We type
-describe- and learn the dipslay format is %41s, so let's change it
to a %-41s format:
. describe
<output omitted>
. format line %-41s
Now -list- will look better. The data I have looks left-justified,
but I'm worried some line has a leading blank. I worry about trailing
blanks, too, so let's get rid of any that happen to be around
. replace line = strtrim(line)
(0 real changes made)
Step 2: Get a record id
------------------------
Type the following:
. gen long recnum = _n
This is very important. At this stage, the meaning of the data is
dependent on order. If we were accidently to sort the data, it would
become meaningless. Now we can get back to the original order.
Step 3: Break line into first word and rest
--------------------------------------------
Our obsrvations contain lines like "RECTYPE 1", "NUMB 14002", etc.
Let's break out the first word, so that we have
first rest
---------------------------------------------
1. RECTYPE 1
2. NUMB 14002
3. CONST 50852
4. PREV 50852
5. GROUP FA
6. NAME Apples, cooking, raw, peeled
7. EDPOR 1
8. WATER 87.7
9. etc.
---------------------------------------------
Getting the first word is easy, we could just code
. gen first = word(line, 1)
but getting the rest of the line, spaced exactly as it was originally, is more
difficult. One way to proceed
. gen blank = strpos(line, " ")
. gen str first = strtrim(substr(line, 1, blank)) if blank
. replace first = line if blank==0
. gen str rest = strtrim(substr(line, blank, .)) if blank
. drop blank
Now let's list a little of the data, make sure we did that right, and then
. drop line
Step 4: Clean up
----------------
Can we drop some lines? Simon didn't tell us about RECTYPE lines. I'm
going to assume they are irrelevant and get rid of them.
. drop if first=="RECTYPE"
Maybe Simon left somthing unsaid about RECTYPE and we'll need to get rid of
the RECYTPES other than 1, including recrods that follow them. If so, skip
this step and do that later. I just want to get rid of any single-line junk I
can right now.
Step 5: Identifier variable
----------------------------