Title | Splitting a string variable into parts | |
Author | Nicholas J. Cox, Durham University, UK |
I want to split a string variable. The variable case names court cases, and I would like to have separate variables for plaintiff and defendant. They are divided by "V" or "VS" or "V." or "VS.". The number of words on either side of the divide is not constant.
For example:
STAPF V US BUNTEN V CUMBERLAND T ATTY PIPER VS USA MCCAMMON JR V US BOARD OF PAROLES DOE VS. SUE GRABBIT RUNNE
If you have Stata 8 or later, the answer is to type
. split case, p(" V " " VS " " V. " " VS. ")
and you can bail out now, unless you are interested in how to solve the problem from first principles. If you have Stata 7, a previous version of split is available from SSC, and you can bail out now, unless you too want to keep reading.
This is a nice example of how a problem can be easy for people to specify. The challenge is to translate it into Stata. For example, although egen provides several functions for subdividing string variables, this problem, like many others, is best tackled by using the basic string functions.
We need first to find the position of " V " or " VS " or " V. " or " VS." within case. Use the function ustrpos(). (In Stata 8, this function was called index().) ustrpos("string","chars") returns the starting point of the first occurrence of "chars" within "string", and ustrpos(strvar, "chars") does the same for the values of a string variable strvar. Thus ustrpos("frog toad", "o") is 3 because the first occurrence of "o" starts at the 3rd character of "frog toad". If there is no such occurrence, the result is 0. ustrpos("frog","a") is 0, as "a" never occurs within "frog".
So, we look for the position of " V ",
. generate splitat = ustrpos(case," V ")
and, if that doesn't exist, for the position of " VS ",
. replace splitat = ustrpos(case," VS ") if splitat == 0
and, if that doesn't exist, for the position of " V. ",
. replace splitat = ustrpos(case," V. ") if splitat == 0
and, if that doesn't exist, for the position of " VS. ",
. replace splitat = ustrpos(case," VS. ") if splitat == 0
The string we search for includes surrounding spaces. Just searching for "V" would catch any occurrence of "V" within the plaintiff’s name, which we do not want. Also, there is an assumption here that we never get, for example, " V " as part of a plaintiff’s or defendant’s name.
Now it would be worth checking that splitat is never 0:
. list case if splitat == 0
If this is true, we have a problem somewhere, perhaps a typo has occurred, the dividing element was left out, or someone used lowercase. With a few problems, it might be easiest to do small-scale surgery within the editor. In problems with mixtures of case, the functions ustrupper() and ustrlower() come in handy.
That aside, the first variable needed, plaintiff, is now in reach:
. generate str1 plaintiff = "" . replace plaintiff = usubstr(case,1,splitat - 1)
plaintiff is the part of case up to but not including the position at which we split. plaintiff is generated first as an empty str1 variable. We can rely on Stata to work out the appropriate string type when it replaces plaintiff by the desired string. For example, an alternative would be
. generate str80 plaintiff = usubstr(case,1,splitat - 1)
to be followed later by compress. The usubstr() function has three arguments: the string, or string variable, from which we copy a substring; the position of the start of the substring; and the length of the substring to be copied. A period (.) as length means "keep right on to the end of the string".
The second variable, defendant, is (nearly) the rest:
. generate str1 defendant = "" . replace defendant = usubstr(case,splitat + 1,.)
Here the + 1 reflects that we don't want the leading space in " V " or " VS " or " V. " or " VS. " with which all values start. But we still want to strip the "V ", "VS ", "V. ", or "VS. ", and the way to do it is to look for the first space that all these have.
. replace defendant = usubstr(defendant,ustrpos(defendant," ") + 1,.)
Let us look at the arguments to usubstr() more closely. Find the position of the first space within defendant (by construction, it’s the space following "V" or "VS" or "V. " or "VS."). That is,
ustrpos(defendant, " ")
but we want to start after that, so we want as our starting position
ustrpos(defendant, " ") + 1
and we keep right on the end of the string. Below we now have
. l case plaintiff defendant +-----------------------------------------------------------------------+ | case plaintiff defendant | |-----------------------------------------------------------------------| 1. | STAPF V US STAPF US | 2. | BUNTEN V CUMBERLAND T ATTY BUNTEN CUMBERLAND T ATTY | 3. | PIPER VS USA PIPER USA | 4. | MCCAMMON JR V US BOARD OF PAROLES MCCAMMON JR US BOARD OF PAROLES | 5. | DOE VS. SUE GRABBIT RUNNE DOE SUE GRABBIT RUNNE | +-----------------------------------------------------------------------+
In a more extensive dataset, we would check for names that were empty or just contained spaces:
. l plaintiff defendant if ustrtrim(plaintiff) = "" | ustrtrim(defendant) == ""
Strings that are empty or contain one or more spaces all trim to empty.
Here is the code gathered in one place:
. generate splitat = ustrpos(case," V ") . replace splitat = ustrpos(case," VS ") if splitat == 0 . replace splitat = ustrpos(case," V. ") if splitat == 0 . replace splitat = ustrpos(case," VS. ") if splitat == 0 . list case if splitat == 0 . generate str1 plaintiff = "" . replace plaintiff = usubstr(case,1,splitat - 1) . generate str1 defendant = "" . replace defendant = usubstr(case,splitat + 1,.) . replace defendant = usubstr(defendant,ustrpos(defendant," ") + 1,.) . l case plaintiff defendant . l plaintiff defendant if ustrtrim(plaintiff) == "" | ustrtrim(defendant) == ""