Skip to content

ENH: broaden SMILES handling to match RDKit#60

Open
tylerjereddy wants to merge 2 commits into
pckroon:masterfrom
tylerjereddy:treddy_broaden_smiles_acceptance
Open

ENH: broaden SMILES handling to match RDKit#60
tylerjereddy wants to merge 2 commits into
pckroon:masterfrom
tylerjereddy:treddy_broaden_smiles_acceptance

Conversation

@tylerjereddy

@tylerjereddy tylerjereddy commented Jun 26, 2026

Copy link
Copy Markdown
  • Related to a small part of BUG (?): isomorphic molecules are not handled the same way gruenewald-lab/CGsmiles#70.

  • One of the problematic SMILES permutations that RDKit considers isomorphic with a more canonical SMILES was not handled properly by pysmiles. This is a small shim (via regex) and regression test to start enforcing equivalence for at least this one extra permutation.

  • It is possible to generate many more scrambled/randomized SMILES permutations in RDKit, but for now this one extra permutation is the closest to what the chemists provided me, so is of most interest to me at least.

  • In local testing I don't see any problems, altough tests/test_hypothesis.py::Tester::runTest seems to fail for me even on master (decent chance it could be my recent version of hypothesis, which can change case generation by version).


AI usage disclaimer: I did ask an LLM to help me draft the regex for moving the ring termination number relative to the parenthetical double bond. The regression test I wrote myself based on in house needs.

* Related to a small part of
gruenewald-lab/CGsmiles#70.

* One of the problematic SMILES permutations that RDKit
considers isomorphic with a more canonical SMILES was
not handled properly by `pysmiles`. This is a small
shim (via regex) and regression test to start enforcing
equivalence for at least this one extra permutation.

* It is possible to generate many more scrambled/randomized
SMILES permutations in RDKit, but for now this one extra
permutation is the closest to what the chemists provided
me, so is of most interest to me at least.

* In local testing I don't see any problems, altough
tests/test_hypothesis.py::Tester::runTest seems to fail
for me even on `master` (decent chance it could be my recent
version of `hypothesis`, which can change case generation
by version).
Comment thread pysmiles/read_smiles.py Outdated
# sanitize invalid SMILES input that is accepted by RDKit
# see: https://github.com/gruenewald-lab/CGsmiles/issues/70#issuecomment-4750353505
pattern = r'(\(=[A-Z]\))(\d)'
mod_smiles = re.sub(pattern, r'\2\1', smiles)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose one could argue that this is the beginning of a slippery slope to shim around more and more pathological SMILES (RDKit can permute at least a dozen variants of the SMILES inputs used in this PR and still handles them just fine).

It might be useful to clarify why we're not using RDKit in this ecosystem--just the "weight" of the dependency?

Comment thread tests/test_read_smiles.py
@pytest.mark.parametrize("smiles", [
# these are both interpreted as the same
# molecule in RDKit
"C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)",

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other pathological example from the cross-listed issue is far harder to shim around. My one hesitation is the danger of reinventing the wheel given the versatility RDKit has with SMILES canonicalization/permutations.

Comment thread tests/test_read_smiles.py
# these are both interpreted as the same
# molecule in RDKit
"C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)",
"C(c1c2cccc3c2c(cc1)C(=C)C(=C)C3(=C))"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reference:

from rdkit import Chem
smiles_1 = "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)" 
smiles_2 = "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C3(=C))" 
mol_1 = Chem.MolFromSmiles(smiles_1)
mol_2 = Chem.MolFromSmiles(smiles_2)
assert Chem.MolToSmiles(mol_1) == Chem.MolToSmiles(mol_2)

and neither of those original SMILES matches what RDKit considers "canonical." I'm assuming the opensmiles specification is therefore different from what they consider standard/canonical.

@pckroon pckroon left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diving into this <3

Using regex to rewrite smiles before parsing indeed feels very slippery and fragile. As an alternative, could you try changing this branch instead?

elif tokentype == TokenType.RING_NUM:
If I see it correctly the idx - 1 needs to be changed with anchor. If it doesn't work the tests will catch it.

As for the 'why not RDKit': way back when RDKit was a pain to install, requiring you to manually change PATH variables and the like. Now that they also distribute wheels and it's easily pip installable I'm not sure I'd make the same design decision.

One of the things you touch upon in the comments is that there is no such thing as SMILES. There's about 5 different dialects, some documented, others simply being used. And the idea of a 'canonical' smiles is a joke.

* Based on reviewer feedback in the above PR, revert
changes proposing to use a regex to work around
a problematic SMILES input.

* Instead, change the edge handling behavior in `base_smiles_parser()`,
based on reviewer feedback.
@tylerjereddy

Copy link
Copy Markdown
Author

@pckroon ok, I took a stab at backing out my original source changes and applying your suggestion in the most recent commit. The new regression test I added still passes.

test_hypothesis.py remains a bit moody for me locally, even on master, but that was already the case before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants