Skip to content

String dtype: differences in regex operations because of engine (Python vs RE2) #63683

@jorisvandenbossche

Description

@jorisvandenbossche

One area where there can be significant behaviour differences depending on the string dtype backend being used (#63105) is for regex operations:

  • The Python str object-dtype based implementation uses the Python stdlib re module, which I think was generally based on the PCRE engine but has diverged from it.
  • On the other hand, for the pyarrow-backed string dtype, the Arrow C++ / pyarrow library uses the RE2 engine.

And unfortunately, there are lots of different regular expression engines with slight differences in behaviour and supported features (https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines).

We already encountered a few differences: #60833 (RE2 not supporting look-around assertions), #63385 (RE2 not supporting \Z), #62653 (replace with groups).
RE2 explicitly says to "not not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported."

There is probably not an easy definitive list of differences, but @rhshadrach found the (old) https://swtch.com/~rsc/regexp/regexp3.html#caveats article from the RE2 author listing a bunch. That also shows an example where it results in different behaviour and not raising an error (because of an unsupported feature), with $ matching a ending newline differently:

ser = pd.Series(["a\n"])
print(ser.str.contains(r'a$'))
# 0    False
# dtype: bool
print(ser.astype(object).str.contains(r'a$'))
# 0    True
# dtype: bool

Update: the description in the Python bindings of RE2 might give a quite good overview (although it does not mention the above difference), as those bindings specifically mimic the stdlib re module except for behaviour differences because of being RE2 (https://pypi.org/project/google-re2/, https://github.com/google/re2/blob/e7aec5985072c1dbe735add802653ef4b36c231a/python/setup.py#L11-L42):

[about python bindings of RE2] A drop-in replacement for the re module.

It uses RE2 under the hood, of course, so various PCRE features
(e.g. backreferences, look-around assertions) are not supported.
See https://github.com/google/re2/wiki/Syntax for the canonical
reference, but known syntactic "gotchas" relative to Python are:

  • PCRE supports \Z and \z; RE2 supports \z; Python supports \z,
    but calls it \Z. You must rewrite \Z to \z in pattern strings.

Known differences between this module's API and the re module's API: ... [not relevant for this discussion]

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorStringsString extension data type and string data

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions