ENH: broaden SMILES handling to match RDKit#60
Conversation
* Related to a small part of gruenewald-lab/CGsmiles#70. * One of the problematic SMILES permutations that RDKit considers isomorphic with a more canonical SMILES was not handled properly by `pysmiles`. This is a small shim (via regex) and regression test to start enforcing equivalence for at least this one extra permutation. * It is possible to generate many more scrambled/randomized SMILES permutations in RDKit, but for now this one extra permutation is the closest to what the chemists provided me, so is of most interest to me at least. * In local testing I don't see any problems, altough tests/test_hypothesis.py::Tester::runTest seems to fail for me even on `master` (decent chance it could be my recent version of `hypothesis`, which can change case generation by version).
| # sanitize invalid SMILES input that is accepted by RDKit | ||
| # see: https://github.com/gruenewald-lab/CGsmiles/issues/70#issuecomment-4750353505 | ||
| pattern = r'(\(=[A-Z]\))(\d)' | ||
| mod_smiles = re.sub(pattern, r'\2\1', smiles) |
There was a problem hiding this comment.
I suppose one could argue that this is the beginning of a slippery slope to shim around more and more pathological SMILES (RDKit can permute at least a dozen variants of the SMILES inputs used in this PR and still handles them just fine).
It might be useful to clarify why we're not using RDKit in this ecosystem--just the "weight" of the dependency?
| @pytest.mark.parametrize("smiles", [ | ||
| # these are both interpreted as the same | ||
| # molecule in RDKit | ||
| "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)", |
There was a problem hiding this comment.
The other pathological example from the cross-listed issue is far harder to shim around. My one hesitation is the danger of reinventing the wheel given the versatility RDKit has with SMILES canonicalization/permutations.
| # these are both interpreted as the same | ||
| # molecule in RDKit | ||
| "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)", | ||
| "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C3(=C))" |
There was a problem hiding this comment.
for reference:
from rdkit import Chem
smiles_1 = "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C(=C)3)"
smiles_2 = "C(c1c2cccc3c2c(cc1)C(=C)C(=C)C3(=C))"
mol_1 = Chem.MolFromSmiles(smiles_1)
mol_2 = Chem.MolFromSmiles(smiles_2)
assert Chem.MolToSmiles(mol_1) == Chem.MolToSmiles(mol_2)and neither of those original SMILES matches what RDKit considers "canonical." I'm assuming the opensmiles specification is therefore different from what they consider standard/canonical.
pckroon
left a comment
There was a problem hiding this comment.
Thanks for diving into this <3
Using regex to rewrite smiles before parsing indeed feels very slippery and fragile. As an alternative, could you try changing this branch instead?
pysmiles/pysmiles/read_smiles.py
Line 169 in c64f7f6
idx - 1 needs to be changed with anchor. If it doesn't work the tests will catch it.
As for the 'why not RDKit': way back when RDKit was a pain to install, requiring you to manually change PATH variables and the like. Now that they also distribute wheels and it's easily pip installable I'm not sure I'd make the same design decision.
One of the things you touch upon in the comments is that there is no such thing as SMILES. There's about 5 different dialects, some documented, others simply being used. And the idea of a 'canonical' smiles is a joke.
* Based on reviewer feedback in the above PR, revert changes proposing to use a regex to work around a problematic SMILES input. * Instead, change the edge handling behavior in `base_smiles_parser()`, based on reviewer feedback.
|
@pckroon ok, I took a stab at backing out my original source changes and applying your suggestion in the most recent commit. The new regression test I added still passes.
|
Related to a small part of BUG (?): isomorphic molecules are not handled the same way gruenewald-lab/CGsmiles#70.
One of the problematic SMILES permutations that RDKit considers isomorphic with a more canonical SMILES was not handled properly by
pysmiles. This is a small shim (via regex) and regression test to start enforcing equivalence for at least this one extra permutation.It is possible to generate many more scrambled/randomized SMILES permutations in RDKit, but for now this one extra permutation is the closest to what the chemists provided me, so is of most interest to me at least.
In local testing I don't see any problems, altough
tests/test_hypothesis.py::Tester::runTestseems to fail for me even onmaster(decent chance it could be my recent version ofhypothesis, which can change case generation by version).AI usage disclaimer: I did ask an LLM to help me draft the regex for moving the ring termination number relative to the parenthetical double bond. The regression test I wrote myself based on in house needs.