ak.str.extract_regex#

Defined in awkward.operations.str.akstr_extract_regex on line 13.

ak.str.extract_regex(array, pattern, *, highlevel=True, behavior=None)#
Parameters
  • array – Array-like data (anything ak.to_layout recognizes).

  • pattern (str or bytes) – Regular expression with named capture fields.

  • highlevel (bool) – If True, return an ak.Array; otherwise, return a low-level ak.contents.Content subclass.

  • behavior (None or dict) – Custom ak.behavior for the output array, if high-level.

Returns None for every string in array if it does not match pattern; otherwise, a record whose fields are named capture groups and whose contents are the substrings they’ve captured.

Uses Google RE2, and pattern must contain named groups. The syntax for a named group is (?P<...>...) in which the first ... is a name and the last ... is a regular expression.

For example,

>>> array = ak.Array([["one1", "two2", "three3"], [], ["four4", "five5"]])
>>> result = ak.str.extract_regex(array, "(?P<vowel>[aeiou])(?P<number>[0-9]+)")
>>> result.show(type=True)
type: 3 * var * ?{
    vowel: ?string,
    number: ?string
}
[[{vowel: 'e', number: '1'}, {vowel: 'o', number: '2'}, {vowel: 'e', number: '3'}],
 [],
 [None, {vowel: 'e', number: '5'}]]

(The string "four4" does not match because the vowel is not immediately before the number.)

Regular expressions with unnamed groups or features not implemented by RE2 raise an error.

Note: this function does not raise an error if the array does not contain any string or bytestring data.

Requires the pyarrow library and calls pyarrow.compute.extract_regex.