One area where there can be significant behaviour differences depending on the string dtype backend being used (#63105) is for regex operations:
- The Python str object-dtype based implementation uses the Python stdlib
re module, which I think was generally based on the PCRE engine but has diverged from it.
- On the other hand, for the pyarrow-backed string dtype, the Arrow C++ / pyarrow library uses the RE2 engine.
And unfortunately, there are lots of different regular expression engines with slight differences in behaviour and supported features (https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines).
We already encountered a few differences: #60833 (RE2 not supporting look-around assertions), #63385 (RE2 not supporting \Z), #62653 (replace with groups).
RE2 explicitly says to "not not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported."
There is probably not an easy definitive list of differences, but @rhshadrach found the (old) https://swtch.com/~rsc/regexp/regexp3.html#caveats article from the RE2 author listing a bunch. That also shows an example where it results in different behaviour and not raising an error (because of an unsupported feature), with $ matching a ending newline differently:
ser = pd.Series(["a\n"])
print(ser.str.contains(r'a$'))
# 0 False
# dtype: bool
print(ser.astype(object).str.contains(r'a$'))
# 0 True
# dtype: bool
Update: the description in the Python bindings of RE2 might give a quite good overview (although it does not mention the above difference), as those bindings specifically mimic the stdlib re module except for behaviour differences because of being RE2 (https://pypi.org/project/google-re2/, https://github.com/google/re2/blob/e7aec5985072c1dbe735add802653ef4b36c231a/python/setup.py#L11-L42):
[about python bindings of RE2] A drop-in replacement for the re module.
It uses RE2 under the hood, of course, so various PCRE features
(e.g. backreferences, look-around assertions) are not supported.
See https://github.com/google/re2/wiki/Syntax for the canonical
reference, but known syntactic "gotchas" relative to Python are:
- PCRE supports \Z and \z; RE2 supports \z; Python supports \z,
but calls it \Z. You must rewrite \Z to \z in pattern strings.
Known differences between this module's API and the re module's API: ... [not relevant for this discussion]
One area where there can be significant behaviour differences depending on the string dtype backend being used (#63105) is for regex operations:
remodule, which I think was generally based on the PCRE engine but has diverged from it.And unfortunately, there are lots of different regular expression engines with slight differences in behaviour and supported features (https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines).
We already encountered a few differences: #60833 (RE2 not supporting look-around assertions), #63385 (RE2 not supporting
\Z), #62653 (replace with groups).RE2 explicitly says to "not not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported."
There is probably not an easy definitive list of differences, but @rhshadrach found the (old) https://swtch.com/~rsc/regexp/regexp3.html#caveats article from the RE2 author listing a bunch. That also shows an example where it results in different behaviour and not raising an error (because of an unsupported feature), with
$matching a ending newline differently:Update: the description in the Python bindings of RE2 might give a quite good overview (although it does not mention the above difference), as those bindings specifically mimic the stdlib
remodule except for behaviour differences because of being RE2 (https://pypi.org/project/google-re2/, https://github.com/google/re2/blob/e7aec5985072c1dbe735add802653ef4b36c231a/python/setup.py#L11-L42):