ENH: broaden SMILES handling to match RDKit by tylerjereddy · Pull Request #60 · pckroon/pysmiles

tylerjereddy · 2026-06-26T02:34:18Z

Related to a small part of BUG (?): isomorphic molecules are not handled the same way gruenewald-lab/CGsmiles#70.
One of the problematic SMILES permutations that RDKit considers isomorphic with a more canonical SMILES was not handled properly by pysmiles. This is a small shim (via regex) and regression test to start enforcing equivalence for at least this one extra permutation.
It is possible to generate many more scrambled/randomized SMILES permutations in RDKit, but for now this one extra permutation is the closest to what the chemists provided me, so is of most interest to me at least.
In local testing I don't see any problems, altough tests/test_hypothesis.py::Tester::runTest seems to fail for me even on master (decent chance it could be my recent version of hypothesis, which can change case generation by version).

AI usage disclaimer: I did ask an LLM to help me draft the regex for moving the ring termination number relative to the parenthetical double bond. The regression test I wrote myself based on in house needs.

* Related to a small part of gruenewald-lab/CGsmiles#70. * One of the problematic SMILES permutations that RDKit considers isomorphic with a more canonical SMILES was not handled properly by `pysmiles`. This is a small shim (via regex) and regression test to start enforcing equivalence for at least this one extra permutation. * It is possible to generate many more scrambled/randomized SMILES permutations in RDKit, but for now this one extra permutation is the closest to what the chemists provided me, so is of most interest to me at least. * In local testing I don't see any problems, altough tests/test_hypothesis.py::Tester::runTest seems to fail for me even on `master` (decent chance it could be my recent version of `hypothesis`, which can change case generation by version).

tylerjereddy · 2026-06-26T02:36:59Z

+    # sanitize invalid SMILES input that is accepted by RDKit
+    # see: https://github.com/gruenewald-lab/CGsmiles/issues/70#issuecomment-4750353505
+    pattern = r'(\(=[A-Z]\))(\d)'
+    mod_smiles = re.sub(pattern, r'\2\1', smiles)


I suppose one could argue that this is the beginning of a slippery slope to shim around more and more pathological SMILES (RDKit can permute at least a dozen variants of the SMILES inputs used in this PR and still handles them just fine).

It might be useful to clarify why we're not using RDKit in this ecosystem--just the "weight" of the dependency?

tylerjereddy · 2026-06-26T02:38:14Z

+@pytest.mark.parametrize("smiles", [
+    # these are both interpreted as the same
+    # molecule in RDKit
+    "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)",


The other pathological example from the cross-listed issue is far harder to shim around. My one hesitation is the danger of reinventing the wheel given the versatility RDKit has with SMILES canonicalization/permutations.

tylerjereddy · 2026-06-26T02:42:05Z

+    # these are both interpreted as the same
+    # molecule in RDKit
+    "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)",
+    "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C3(=C))"


for reference:

from rdkit import Chem smiles_1 = "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)" smiles_2 = "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C3(=C))" mol_1 = Chem.MolFromSmiles(smiles_1) mol_2 = Chem.MolFromSmiles(smiles_2) assert Chem.MolToSmiles(mol_1) == Chem.MolToSmiles(mol_2)

and neither of those original SMILES matches what RDKit considers "canonical." I'm assuming the opensmiles specification is therefore different from what they consider standard/canonical.

pckroon

Thanks for diving into this <3

Using regex to rewrite smiles before parsing indeed feels very slippery and fragile. As an alternative, could you try changing this branch instead?

pysmiles/pysmiles/read_smiles.py

Line 169 in c64f7f6

elif tokentype == TokenType.RING_NUM:

If I see it correctly the idx - 1 needs to be changed with anchor. If it doesn't work the tests will catch it.

As for the 'why not RDKit': way back when RDKit was a pain to install, requiring you to manually change PATH variables and the like. Now that they also distribute wheels and it's easily pip installable I'm not sure I'd make the same design decision.

One of the things you touch upon in the comments is that there is no such thing as SMILES. There's about 5 different dialects, some documented, others simply being used. And the idea of a 'canonical' smiles is a joke.

* Based on reviewer feedback in the above PR, revert changes proposing to use a regex to work around a problematic SMILES input. * Instead, change the edge handling behavior in `base_smiles_parser()`, based on reviewer feedback.

tylerjereddy · 2026-06-26T18:56:50Z

@pckroon ok, I took a stab at backing out my original source changes and applying your suggestion in the most recent commit. The new regression test I added still passes.

test_hypothesis.py remains a bit moody for me locally, even on master, but that was already the case before.

tylerjereddy commented Jun 26, 2026

View reviewed changes

tylerjereddy had a problem deploying to testpypi June 26, 2026 07:19 — with GitHub Actions Failure

pckroon requested changes Jun 26, 2026

View reviewed changes

MAINT: PR 60 revisions

45e6fd9

* Based on reviewer feedback in the above PR, revert changes proposing to use a regex to work around a problematic SMILES input. * Instead, change the edge handling behavior in `base_smiles_parser()`, based on reviewer feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: broaden SMILES handling to match RDKit#60

ENH: broaden SMILES handling to match RDKit#60
tylerjereddy wants to merge 2 commits into
pckroon:masterfrom
tylerjereddy:treddy_broaden_smiles_acceptance

tylerjereddy commented Jun 26, 2026 •

edited

Loading

Uh oh!

tylerjereddy Jun 26, 2026

Uh oh!

tylerjereddy Jun 26, 2026

Uh oh!

tylerjereddy Jun 26, 2026

Uh oh!

pckroon left a comment

Uh oh!

tylerjereddy commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tylerjereddy commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tylerjereddy Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pckroon left a comment

Choose a reason for hiding this comment

Uh oh!

tylerjereddy commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tylerjereddy commented Jun 26, 2026 •

edited

Loading