Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: string function
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: string function
Date
Wed, 24 Aug 2011 12:19:29 +0100
As stressed in that SJ Tip: for substrings longer than one character,
you need to divide
(length("abcdaf") - length(subinstr("abcdaf", "abc", "", .))) / length("abc")
See also -moss- (SSC)
Title
moss -- Find multiple occurrences of substrings
Syntax
moss strvar [if] [in] match(["]pattern["]) [ regex
prefix(prefix) suffix(suffix) maximum(#)
compact ]
Description
moss finds occurrences of substrings matching a pattern in a given
string variable. Depending on what
is sought and what is found, variables are created giving the
count of occurrences (always); the
positions of occurrences (whenever any are found); and the exact
substrings found (when a regular
expression defines a subexpression to be returned). The default
names are respectively _count, _pos1
up, and _match1 up.
Remarks
By default, moss finds repeated occurrences of the string
specified in match() using Stata's strpos()
string function (in older versions of Stata, strpos() was named
index()). A _count variable is
created to indicate the number of occurrences per observation. The
position, per observation, of the
first instance will be recorded in _pos1, the second in _pos2, and so on.
With the regex option, moss can be used to repeatedly find more
complex patterns of text. The
specification of the search pattern must follow regexm() syntax
and include one and only one
subexpression to be matched. When using regular expressions,
subexpressions are identified using
parentheses. For example, match("AMC ([A-Za-z]+)") will match
"AMC Concord", "AMC Pacer", and "AMC
AMC Spirit" but moss will put in _match1 the matched
subexpressions "Concord", "Pacer", and "AMC
Spirit".
moss follows the principle that occurrences must be disjoint and
may not overlap. That is, it finds
just one occurrence of "ana" in "banana", not two.
Options
match() is required and the pattern can be either literal text or
a regular expression.
regex specifies that the pattern is to be interpreted as a regular
expression. Such a pattern must
contain precisely one subexpression to be extracted. See Examples.
prefix() specifies an alternative prefix for new variable names to
be created by moss. Such a prefix
must start either with a letter or with an underscore.
suffix() specifies a suffix for new variable names to be created.
prefix() and suffix() may not be combined.
maximum() specifies an upper limit to the number of position and
match variables to be created. That
is, specify max(3) if you want to see details of at most the
first 3 occurrences of your pattern.
compact specifies that the most compact storage types possible be
used during calculations.
Specifying this option may slow moss down.
Examples
. moss make, match(",")
. moss make, match("([0-9]+)") regex
. moss history, match("(X+)") regex
. moss s, match("([^ ]+)") prefix(s_) regex
Authors
Robert Picard
[email protected]
Nicholas J. Cox, Durham University
[email protected]
Acknowledgments
A question on Statalist from Rebecca A. Pope was the stimulus for
writing this program.
On Wed, Aug 24, 2011 at 11:59 AM, Nick Cox <[email protected]> wrote:
> Solutions to all these could be written as -egen- functions or Mata functions.
>
> Here I focus on "official Stata only" solutions.
>
> First question is discussed in
>
> Nicholas J. Cox
> Stata tip 98: Counting substrings within strings
> The Stata Journal 11(2): 318-320
>
> length("abcdaf") - length(subinstr("abcdaf", "a", "", .))
>
> Last two questions
>
> any of "a", "b", "c"
>
> max(strpos("abcdaf","a"), strpos("abcdaf", "b"), strpos("abcdaf", "c")) > 0
>
> all of "a", "b", "c"
>
> min(strpos("abcdaf","a"), strpos("abcdaf", "b"), strpos("abcdaf", "c")) > 0
>
> If you had a long list of candidates, I would do something like this:
>
> gen found = 0
>
> qui foreach letter in s o m e t h i n g {
> replace found = max(found, strpos(strvar, "`letter'") > 0)
> }
>
> where for "max" substitute "min" as needed.
>
> The mapping max <-> any, min <-> all is discussed in
> http://www.stata.com/support/faqs/data/anyall.html
>
> Nick
>
> 2011/8/24 Grace Jessie <[email protected]>:
>
>> How to count how many times a substring appears in a string?
>> For example,
>> function("abcdaf","a")=2
>>
>> And, how to check if a string variable has certain substrings?
>> With regard to this, I want to ask two functions.
>> For example,
>> function("abcdaf","a","b","c")
>> One of what I want to do is to return 1 if a or b or c is included in "abcdaf", ;
>> the other is to return 1 if a, b and c are included in "abcdaf".
>> Could anyone tell me the correct functions for thoes above?
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/