Title | Removing leading or trailing zeros from string variables | |
Authors |
Nicholas J. Cox, Durham University, UK Jamie Griffin, Imperial College London, UK |
Suppose you wish to remove leading or trailing zeros from a string variable (or from a global or local macro). To be clear on terminology here, a string may contain zeros in leading positions, such as "0string"; in trailing positions, such as "string00"; in both; or in some intermediate position, such as "string000string". Here string stands for any string containing characters other than zero "0".
This problem is interesting because it provides an opportunity to consider the operation of various string functions. When we say zeros, in principle we could say any other character, except leading and trailing blanks should be removed using ustustrrtrim(), ustrltrim(), or ustrrtrim().
To make the problem as general and as challenging as possible, assume that we do not know in advance how many leading or trailing zeros there are and zeros may also occur in intermediate positions.
However, let us recall what to do if we do know how many leading or trailing zeros occur and zeros do not occur in intermediate positions. As we said, you can use these solutions with some modification to remove other characters. If we knew there was just one leading zero, and no other zero, we could remove it by changing it to an empty string using the usubinstr() function by typing either
. replace myvar = usubinstr(myvar, "0", "", 1)
or
. replace myvar = usubinstr(myvar, "0", "", .)
The last argument of usubinstr() is just the number of instances to be replaced, either the first or all of them, and has the same effect if at most one zero is present. This solution also applies to at most one trailing zero and to no others. However, usubinstr() can remove characters we want to keep.
Another line of attack is to use the usubstr() function. This works best if we know precisely how many characters to remove. Thus typing
. replace myvar = usubstr(myvar, 2, .)
and
. replace myvar = usubstr(myvar, 1, ustrlen(myvar) − 1)
respectively would remove the first and last character only. We could make that conditional on the character in question being "0", as in
. replace myvar = usubstr(myvar, 2, .) if usubstr(myvar,1,1) == "0"
and
. replace myvar = usubstr(myvar, 1, ustrlen(myvar) − 1) if > usubstr(myvar,−1,1) == "0"
which for our problem would do no harm and possibly some good.
However, we have yet to solve the general problem posed above. One idea is to keep removing zeros, so long as we find them, in a loop, as in
. quietly count if usubstr(myvar, 1, 1) == "0" . while r(N) { . replace myvar = usubstr(myvar, 2, .) if usubstr(myvar, 1, 1) == "0" . count if usubstr(myvar, 1, 1) == "0" . }
Here the result of count is left behind in r(N). If that is not zero, i.e., if r(N) is one or more, Stata enters the while loop. We remove the first zero whenever there is one and count again. Stata will exit the loop as soon as r(N) becomes zero. A similar technique could be used to remove an arbitrary number of trailing zeros, testing with usubstr(myvar, −1, 1) == "0" and removing with usubstr(myvar, 1, ustrlen(myvar) − 1).
This flexes a few small programming muscles but otherwise is less attractive than a better way that makes use of another function. indexnot() finds the position of the first character other than the one specified.
. replace myvar = usubstr(myvar, indexnot(myvar, "0"), .)
Note that if myvar did not contain leading zeros, indexnot() would return 1 and myvar would just be replaced by itself. One way to find trailing zeros with this approach is to ustrreverse() the string before and after applying the same logic:
. replace myvar = ustrreverse(usubstr(ustrreverse(myvar), indexnot(ustrreverse(myvar), "0"), .))
Cautious souls will want to see this done step by step:
. generate rev = ustrreverse(myvar) . replace rev = usubstr(rev, indexnot(rev, "0"), .) . replace myvar = ustrreverse(rev) . drop rev