Dear Friedrich, Phil and Nick:
Thank you all very much for your help!
Mingfeng
On Wed, Nov 5, 2008 at 7:56 AM, Nick Cox <[email protected]> wrote:
> I think Phil is correct so far as official Stata is concerned.
>
> But there are -egen- functions -noccur()- and -nss()- in -egenmore- from
> SSC.
>
> The help explains:
>
> ===================
> noccur(strvar) , string(substr) creates a variable containing the number
> of occurrences of the string substr in string variable strvar. Note
> that occurrences must be disjoint (non-overlapping): thus there are two
> occurrences of "aa" within "aaaaa". (Stata 7 required.)
>
> nss(strvar) , find(substr) [ insensitive ] returns the number of
> occurrences of substr within the string variable strvar. insensitive
> makes counting case-insensitive. (Stata 6 required.)
>
> The inclusion of noccur() and nss(), two almost identical functions, was
> an act of sheer inadvertence by the maintainer.
> =================
>
> These functions both predate regular expression syntax in Stata, but I
> don't think that latter helps much, if at all, with this particular
> problem. It's certainly not essential, as Phil's solution also
> indicates.
>
> Use -ssc inst egenmore- to install, and then -help egenmore-.
>
> Nick
> [email protected]
>
> Phil Schumm
>
> No, I don't believe so. There are two ways to approach this: (1)
> compute the number of occurrences for each observation and then loop
> over observations, or (2) proceed one occurrence at a time, handling
> all observations at once. The first approach would in general be more
> efficient if the variance in the number of occurrences were large;
> note that it would need to be done in Mata for it to scale well in the
> number of observations. However, the fact that string variables can
> only be 244 characters long imposes an upper bound on the maximum
> number of occurrences (and therefore on the variance), and, in many
> situations, the effective upper bound may be pretty small (i.e., at
> most only a couple of occurrences per observation). In such cases,
> the second approach would be adequate, e.g.,
>
> tempvar t1 t2
> gen `t1' = X
> gen `t2' = X
> gen Y = 0
> qui while 1 {
> replace `t1' = subinstr(`t1', "john", "", 1)
> cap ass `t1'==`t2'
> if _rc {
> replace Y = Y + (`t1'!=`t2')
> replace `t2' = `t1'
> }
> else continue, br
> }
>
> where -regexr()- can be substituted for -subinstr()- if additional
> flexibility in matching is required.
>
> On Nov 4, 2008, at 8:42 PM, Mingfeng Lin wrote:
>
>> I looked through the list of string functions but couldn't find one
>> that fits the bill. Suppose I have a string variable X, and I would
>> like to generate a new numeric variable Y containing the number of
>> times a certain string appeared in X. For instance
>>
>> X = "johnabc johncd"
>>
>> If I'd like to find the number of times "john" shows up in X, I hope
>> to obtain Y = 2
>>
>> Is there a function in Stata to do this?
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/