Wanli Zhao <[email protected]> has two questions on string functions:
> 1. Is there some way to increase the limit of string length of 244?
> When I create string from other string variables, I found some
> weird things happen. Now I realize it's due to the length limit of
> string variable. I am using SE 9.2.
>
>
> 2. I asked this before. I have a string variable. One observation is
> like "a, b, f, g, b, a, a, f, g, g". How do I create another
> variable which shows no repeated values, i.e., "a, b, f, g". The
> sequence does not matter.
The answer to the first is no with a proviso: you can use Mata to work
around the 244 limit so long as, Statawise, the inputs and outputs are
no longer than 244.
I assume there are a number of answers to the second question. The answer
I'm going to show uses Mata, mainly to show how one might go into and out
of Mata to solve a string problem.
So let's assume we have Stata variable -replies- containing strings
like "a, b, f, g, b, a, a, f, g, g" (order not significant).
I just made the following example dataset:
. list
+-------------------------------+
| replies |
|-------------------------------|
1. | a , b, f, g, b, a, a, f, g, g |
2. | |
3. | b, a, b |
+-------------------------------+
The first thing I want to do is get rid of the commas. This can be done
in Stata, or in the midst of our Mata code. I'm going to do it in Stata:
. replace replies = subinstr(replies, ",", "", .)
. list
+----------------------+
| replies |
|----------------------|
1. | a b f g b a a f g g |
2. | |
3. | b a b |
+----------------------+
Here is my solution. First, I created a do-file to contain my Mata code:
--------------------------------------------------- mymatacode.do ---
mata:
mata clear
void fixvar(string scalar varname)
{
string colvector data
st_sview(data, ., varname)
for (i=1; i<=rows(data); i++) {
data[i] = myinvtokens( uniqrows(tokens(data[i])') )
}
}
string scalar myinvtokens(string vector s)
{
string scalar result
real scalar i
if (length(s)) {
result = s[1]
for (i=2; i<=length(s); i++) {
result = result + " " + s[i]
}
}
return(result)
}
end
--------------------------------------------------- mymatacode.do ---
With that do-file written, I typed,
. do mymatacode
<output omitted>
. mata: fixvar("replies")
. list
+---------+
| replies |
|---------|
1. | a b f g |
2. | |
3. | a b |
+---------+
I apologize for all the code above. Routine myinvtokens() would be
unnecessary if you have Ben Jann's MF_INVTOKENS installed and, really,
Mata should have had an -invtokens()- function all along.
The routine that's important above is
void fixvar(string scalar varname)
{
string colvector data
st_sview(data, ., varname)
for (i=1; i<=rows(data); i++) {
data[i] = myinvtokens( uniqrows(tokens(data[i])') )
}
}
and, as always, I emphasize you could have omitted the declarations:
void fixvar(varname)
{
st_sview(data, ., varname)
for (i=1; i<=rows(data); i++) {
data[i] = myinvtokens( uniqrows(tokens(data[i])') )
}
}
I include the declarations because I'm hoping that will help you understand
the program. Maybe it would have been better had I been a bit more verbose
in my code,
void fixvar(varname)
{
st_sview(data, ., varname)
for (i=1; i<=rows(data); i++) {
orig = data[i]
origasvec = tokens(orig)
uniqorig = uniqrows(origasvec')
data[i] = myinvtokens(uniqorig)
}
}
Anyway, data[] is a view unto varname, which will be "replies".
data[i] is thus the i-th obsrvation of replies.
tokens(data[i]) changes "a b a" into row vector ("a", "b", "a").
Next I use function uniqrows(). There is no -uniqcols()- function, so I
transpose the argument tokens(data[i]): uniqrows(tokens(data[i])').
Now I have ("a", "b"). I put that back into a scalar as "a b", and replace
data[i].
In the above, I didn't really need to make -fixvar()- a program. I could
have done it interactively, something like
--------------------------------------------------- mymatacode.do ---
mata:
mata clear
function myinvtokens(s)
{
if (length(s)) {
result = s[1]
for (i=2; i<=length(s); i++) {
result = result + " " + s[i]
}
}
return(result)
}
st_sview(data=., ., "replies")
for (i=1; i<=rows(data); i++) {
data[i] = myinvtokens( uniqrows(tokens(data[i])') )
}
end
--------------------------------------------------- mymatacode.do ---
Now, if I had Ben Jann's -invoken()- function, I could have used that.
I assume Ben's -invtoken()- requires a row vector as an argument, and I
have a column vector, so I add a transpose to my code:
--------------------------------------------------- mymatacode.do ---
mata:
st_sview(data=., ., "replies")
for (i=1; i<=rows(data); i++) {
data[i] = invtokens( uniqrows(tokens(data[i])')' )
}
end
--------------------------------------------------- mymatacode.do ---
That, really, is the gist of the solution.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/