You identify the problems very clearly, and you
are aware of some useful tools here. However,
a variable so messy is not going to surrender
faced with just a single weapon of mess dissection.
Here's one way of proceeding.
I copied your data and argued as follows.
1. The value is most commonly the first word
of the string, but we want it as a number.
. gen fvalue = real(word(ferritin,1))
(140 missing values generated)
2. The date is most commonly the second word
of the string, interpreted as a (recent) daily date.
. gen fdate = date(word(ferritin,2),"mdy",2050)
(162 missing values generated)
3. I see lots of values with "U"s. I want to check
that when "U" occurs it occurs only as itself.
. assert ferritin == "U" if index(ferritin, "U")
Here -assert- says nothing. No news is good news,
i.e. Stata assents to the assertion. (In this
case, all "U"s are bad news, i.e. no information
at all.)
4. So I know all about the "U" problem. What values
are missing, but not because of "U"?
. list ferritin fvalue if mi(fvalue) & ferritin != "U"
+------------------------+
| ferritin fvalue |
|------------------------|
178. | normal . |
225. | . |
286. | >1400 3/24/99 . |
324. | > 2K 1/7/02 . |
+------------------------+
You have to decide what to do about these.
5. What dates are missing, but not because of
these?
. list ferritin fdate if mi(fdate) & ferritin != "U"
+---------------------------+
| ferritin fdate |
|---------------------------|
66. | 5760 5/02 . |
70. | 4968 . |
98. | 4782 6/02 . |
117. | 1137 4/02 . |
121. | 3237 3/01 . |
|---------------------------|
126. | 1060 8/02 . |
148. | 1422 ng/ml 3/1998 . |
154. | 1014 8/01 . |
178. | normal . |
209. | 646 mg/ml 9/2000 . |
|---------------------------|
210. | 1938 . |
212. | 127 . |
225. | . |
278. | 129 6/00 . |
289. | 2000 . |
|---------------------------|
300. | 4995 . |
301. | 2748 9/02 . |
302. | 187 3/2000 . |
303. | 489 . |
305. | 3564 11/01 . |
|---------------------------|
307. | 862 1990 . |
324. | > 2K 1/7/02 . |
333. | 3039 2/02 . |
338. | 1777 ng/ml 5/02 . |
352. | 40 3/02 . |
|---------------------------|
357. | 953 . |
+---------------------------+
Similarly, you have to decide what to do here. There
are, it seems, various kinds of problem:
* No date supplied.
* A year only supplied.
* A month and year only supplied.
* A day, month and year supplied.
* More than two words in the string.
In fact, we would have been better off
trying to extract the date from
-word(ferritin,-1)-, and that would
yield one further daily date.
You can use -assert- to test for truth
or falsity. Thus
. assert 42 == int(42)
asserts that 42 is an integer. More
usefully
. assert x == int(x)
asserts that -x- contains integer
values only, as only for integers
is -x- unchanged by applying the -int()-
function.
Similarly
. assert real(date(stringvar, "mdy", 2050)) < .
asserts that all values of stringvar can
be treated as daily dates. By itself -assert- has
very little syntax, and using it requires knowledge
of other parts of Stata.
Nick
[email protected]
Ward Hagar
>
> I am new to Stata and new to databases. I've been trying to
> get an access database "Stata ready". I can insheet it well
> and have "cleaned up" most of the variables. However, I've
> hit a snag with one variable named "Ferritin", reproduced
> below for those interested. (n= 362).
>
> My goal is to separate the value form the date.
>
> But note:
> 1. Most have a four digit value
> 2. Many have a date of the test in an inconsistent format
> 3. Many have "U" for unknown
> 4. One has "normal" for a value
> 5. A few have the ">" operator before the value
> 6. One has the units for the measurement
>
>
> I've tried the date() functions, split, and word() and all
> cause some other problem.
>
> This is important because the ferritin is the first of a list
> of similarly (mis)coded variables.
>
> Is there a Stata-esque approach to this, or am I left with
> line-by-line cleanup?
>
> Second question is whether ASSERT can be used to test whether
> a value is a date, integer, string, etc?
>
> Many thanks.
>
> . l ferritin
>
> +-------------------+
> | ferritin |
> |-------------------|
> 1. | 540 11/6/02 |
> 2. | U |
> 348. | U |
> 349. | U |
> 350. | 132 2/20/02 |
> |-------------------|
> 351. | U |
> 352. | 40 3/02 |
> 353. | 97 3/29/02 |
> 354. | 6151 5/22/02 |
> 355. | U |
> |-------------------|
> 356. | 1390 5/8/02 |
> 357. | 953 |
> 358. | 54 8/2/02 |
> 359. | 2779 8/12/02 |
> 360. | U |
> |-------------------|
> 361. | U |
> 362. | U |
> +-------------------+
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/