Eric (Ric) Uslaner asked (edited slightly)
> I have a string variable problem. One variable in a dataset is
> composed of both the name of a country and the year of a survey,
> such as:
>
> Albania2002
> Albania2005
> Serbia&Montenegro2002
> Serbia&Montenegro2005
>
> I want to drop the last four digits. (There is already a variable
> called -year-.) I figured out how to put the last four digits into
> another variable with -substr()- but cannot figure out how to keep
> country names (which vary in length) and drop the year.
Solutions were suggested by Kit Baum, Pete Huckelba, Svend Juul, and
Carole J. Wilson.
Let's assume Ric's variable is called -id-.
How do you try to solve a problem like this?
Step 0: Diagnosis
=================
What kind of a problem is this? Will there be a straightforward
solution using existing Stata functionality or will someone (perhaps
you) need to write a program?
This is a problem in data management with string variables.
Ric's guess (presumably) was that it is not esoteric: no
program should be needed and existing functionality should suffice.
That was correct. Such a guess narrows it down to something documented
in [U] or [D].
Ric wants to omit the last 4 characters specifying year.
The twist that gives the problem a little extra spin is the
irregular length of the string. If the string was always (say)
16 characters long then the solution would be immediate as
-substr(id, 1, 12)-.
Step 1: -generate-?
===================
We just want to extract some of the characters from a string variable
and make a new variable, or equivalently to omit the other characters.
That should suggest a -generate- command. Even though Ric
wants to throw away some of the characters, using -replace- would be worse
style. You might mess up your data if you get a -replace- wrong, or
(at least in other loosely related problems), you might change
your mind later and want to use what you just threw away.
The immediate question is thus how to specify the rule for the right-hand side
in a -generate- solution.
A further question is whether there are other ways to do it.
We'll get to that in a moment.
Step 2: functions?
==================
I typically consider next whether any existing functions
will do the job. Functions fall into two classes, those you
know you want to use and those you don't know you should use.
What is crucial is that many jobs require two or more
functions working together.
I suspect that many people skim through the list of
functions and are often then disappointed that nothing
matches their problem exactly. There is no function
that omits the last # characters. It would be easy
enough for StataCorp to double the number of functions
by adding many more, including that one, but that would
not really double versatility, just complexity. The
toolkit philosophy is to provide tools that individually
do one thing, but in conjunction can solve a larger
variety of problems.
-substr()- and -length()-
-------------------------
Kit Baum and Svend Juul both suggested using -substr()-
together with -length()-. Svend's solution is on
these lines:
gen length = length(id)
gen newid = substr(id, 1, length - 4)
Kit's solution is on these lines:
gen newid = substr(id, 1, length(id) - 4)
These are really the same solution. Svend does
it in two steps, Kit in one. If you find Svend's
solution clearer, go with it. The main cost is just another
variable that you probably don't care about
otherwise. You could -drop- it once it has
outlived its usefulness.
In more detail:
* -length(strvar)- returns the length of the
contents of a string variable -strvar-.
(-length("strvar")- would give you the length
of the _name_ of the variable, in this case 6.
* -gen length = length(id)- gives a new
variable. -id- has different lengths,
as Ric pointed out, but this is not a problem.
The value of -length- will (literally) vary,
accordingly.
* -length(id)- is going to include all string
characters, including any leading and trailing
blanks. If necessary, find out about -trim()-.
Here's a give-away:
gen newid = substr(trim(id), 1, length(trim(id)) - 4)
takes care of such blanks also.
Note that if Ric had wanted -year- as well, then
that would be
gen year = substr(id, length(id) - 3, 4)
with the same caveat about blanks.
-reverse()- and -substr()-
--------------------------
Pete Huckelba's solution was along these lines:
. gen newid = reverse(substr(reverse(id),5,.))
This is another "it takes two to tango"
solution. If we -reverse()- a string, then
the last character becomes the first. Chop
off the first four characters, previously
the last four, and then -reverse()- the
reversed string to get back to where you want.
Possible problems with blanks would be dealt
with in the same way:
. gen newid = reverse(substr(reverse(trim(id)),5,.))
Step 3: Consider other commands
===============================
The problem is solved, but knowing about other
possible solutions is also worthwhile.
Carole Wilson suggested the use of -egen, ends()-.
Her solution is along these lines:
If all years begin with "2":
egen newid = ends(id), punct(2) head
If you have dates from last century:
egen newid = ends(id), punct(1) head
However, this solution is problematic. You
are making an assumption that characters
like "1" and "2" do not occur as part of
country names. Now I am a geographer and
some people expect me to know about such
things, but I wouldn't want to rule out
such a possibility. More importantly, it
seems very likely that some years in Ric's
data begin with "1" and some with "2",
and -egen, ends()- does not work especially
well with that situation.
Much the same applies to -split-, which no
one mentioned. It was not really designed
for Ric's kind of problem, and although
it would be more help than nothing,
a solution with functions seems to me much
better.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/