|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: counting the number of times a string appears in a string variable?
From |
Phil Schumm <[email protected]> |
To |
[email protected] |
Subject |
Re: st: counting the number of times a string appears in a string variable? |
Date |
Wed, 5 Nov 2008 02:05:31 -0600 |
On Nov 4, 2008, at 8:42 PM, Mingfeng Lin wrote:
I looked through the list of string functions but couldn't find one
that fits the bill. Suppose I have a string variable X, and I would
like to generate a new numeric variable Y containing the number of
times a certain string appeared in X. For instance
X = "johnabc johncd"
If I'd like to find the number of times "john" shows up in X, I hope
to obtain Y = 2
Is there a function in Stata to do this?
No, I don't believe so. There are two ways to approach this: (1)
compute the number of occurrences for each observation and then loop
over observations, or (2) proceed one occurrence at a time, handling
all observations at once. The first approach would in general be more
efficient if the variance in the number of occurrences were large;
note that it would need to be done in Mata for it to scale well in the
number of observations. However, the fact that string variables can
only be 244 characters long imposes an upper bound on the maximum
number of occurrences (and therefore on the variance), and, in many
situations, the effective upper bound may be pretty small (i.e., at
most only a couple of occurrences per observation). In such cases,
the second approach would be adequate, e.g.,
tempvar t1 t2
gen `t1' = X
gen `t2' = X
gen Y = 0
qui while 1 {
replace `t1' = subinstr(`t1', "john", "", 1)
cap ass `t1'==`t2'
if _rc {
replace Y = Y + (`t1'!=`t2')
replace `t2' = `t1'
}
else continue, br
}
where -regexr()- can be substituted for -subinstr()- if additional
flexibility in matching is required.
-- Phil
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/