Home  /  Products  /  Stata 18  /  Boost-based regular expressions

<- See Stata 18's new features

Highlights

Regular expressions are powerful tools for working with string data. Stata's regular expressions have become even more powerful, with more features, in Stata 18.

Overview

Regular expressions are used for

  • data validation, for example, to check whether a phone number is well formed;

  • data extraction, for example, to extract phone numbers from a string; and

  • data transformation, for example, to normalize different phone number inputs.

Stata provides two sets of regular expression functions: byte-stream-based regexm(), regexr(), and regexs(); and Unicode-based ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs(). Unicode-based regular expression functions are built on top of ICU libraries.

In Stata 18, the byte-stream-based functions are updated to use the Boost library as the engine. The functions are user-version controlled to retain the old behavior if a user specifies version 17:.

A good discussion of regular expressions in Stata can be found in Asjad Naqvi's Stata guide.

The old implementation is based on Henry Spencer's NFA algorithm and is nearly identical to the POSIX.2 standard. The new implementation in Stata 18 has more features. For example, the new implementation supports {n} for matching a regular expression exactly n times:

. display regexm("123", "\d{3}")
1

. version 17: display regexm("123", "\d{3}")
0

A set of new functions that exclusively use the Boost library have been added:

  • regexmatch() performs a match of a regular expression to an ASCII string.

  • regexreplace() replaces the first substring that matches a regular expression with specified text.

  • regexreplaceall() replaces all substrings that match a regular expression with specified text.

  • regexcapture() returns a subexpression from a previous match.

  • regexcapturenamed() returns a subexpression corresponding to a matching named group in a regular expression from a previous match.

Let's see it work

We would like to match and extract phone numbers in the addresses of heads of governments.

We require the following rules:

  • The phone number follows “Phone:” or “tel:”.

  • It may start with “+”.

  • After “+” or at the start, it has 1 to 3 nonzero digits.

  • After that, it can have anywhere from 7 to 32 digits, space, or “-”.

We would like to generate a variable, phone, for the extracted phone number, which does not contain “Phone:” or “tel:” if the address matches.

We would like to generate another variable, address1, to replace the phone number with the extracted phone number in the above followed by “tel:”.

. input str120 address



                      address
  1. "1600 Pennsylvania Ave., NW Washington, DC 20500 tel:1-202-456-1414 USA"
  2. "Palais de l'Élysée 55 rue du Faubourg-Saint-Honoré 75008 Paris, Phone:+33 1 42 92 81 00 France"
  3. "10 Downing Street, SW1A 2AA +44-20-7925-0918 United kingdom"
  4. "東京都千代田区永田町2丁目3番1号 100-0014, Phone: +81 3-3581-0101, Japan"
  5. end

. local reg "(?:Phone\:[\s]*?|tel\:[\s]*)([\+]{0, 1}[1-9]{1, 3}[0-9\s\-]{7,32})"

. generate match = regexmatch(address, "`reg'")

. generate address1 = regexreplace(address, "`reg'", "tel:$1")

. generate phone = regexcapture(1) if regexmatch(address, "`reg'")
(1 missing value generated)

. list phone

phone
1. 1-202-456-1414
2. +33 1 42 92 81 00
3.
4. +81 3-3581-0101

Components of the regular expression in the local macro reg are as follows:

  • (?:Phone\:[\s]*?|tel\:[\s]*)—match either “Phone:” or “tel:” followed by no spaces or some but not capturing the match.

  • ([+]{0, 1}[1-9]{1, 3}[0-9\s-]{7,32})—match and capture a regular expression that satisfies the following:

    • [+]{0, 1}—it may start with “+”.

    • [1-9]{1, 3}—after “+” or at the start, it has 1 to 3 nonzero digits.

    • [0-9\s-]{7,32}—after that, it can have anywhere from 7 to 32 digits, space, or “-”.

We see that the third address does not contain either "Phone:" or "tel:" and thus does not match the regular expression, so phone is missing for this observation.

Made for data science.

Get started today.