support UNIMOD format parsing and centralized mod-dictionary projection#1050
support UNIMOD format parsing and centralized mod-dictionary projection#1050nbollis wants to merge 23 commits into
Conversation
…serializer projection
…mzLib type, empty-string fast path
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1050 +/- ##
==========================================
+ Coverage 81.63% 81.72% +0.08%
==========================================
Files 369 371 +2
Lines 47605 47830 +225
Branches 5649 5696 +47
==========================================
+ Hits 38863 39088 +225
+ Misses 7643 7641 -2
- Partials 1099 1101 +2
🚀 New features to boost your workflow:
|
pcruzparri
left a comment
There was a problem hiding this comment.
TY for patching this!
|
@copilot resolve the merge conflicts in this pull request. The testing directory has been rearranged with proper namespaces and grouping. TestPeptideWithSetMods has been split into several classes |
trishorts
left a comment
There was a problem hiding this comment.
Automated code review - 6 finding(s) at severity >= Major.
| LoadAllProteinModifications(); | ||
| AllProteinModsList = UnimodModifications.Concat(UniprotModifications).Concat(MetaMorpheusProteinModifications).ToList(); | ||
| AllProteinModsList = MetaMorpheusProteinModifications.Concat(UniprotModifications).Concat(UnimodModifications).ToList(); | ||
| AllKnownProteinModsDictionary = AllProteinModsList |
There was a problem hiding this comment.
Major - Reordered concat inverts dedup precedence for duplicate IdWithMotif mods
Problem:
The only functional change in this file reverses the source order used to build AllProteinModsList. Base: UnimodModifications.Concat(UniprotModifications).Concat(MetaMorpheusProteinModifications); new: MetaMorpheusProteinModifications.Concat(UniprotModifications).Concat(UnimodModifications). Because the very next statement applies DistinctBy(m => m.IdWithMotif) — which keeps the first element per key — this reversal silently flips which Modification instance is retained whenever an IdWithMotif (e.g. "Oxidation on M", "Phospho on S") exists in more than one source. Previously the Unimod-sourced instance won; now the MetaMorpheus-sourced instance wins. The unchanged comment a few lines below ("Protein mods taking precedence") documents only protein-vs-RNA precedence, so this new intra-protein precedence is undocumented.
Impact:
Every consumer keyed off AllKnownProteinModsDictionary, AllProteinModsList, AllModsKnownDictionary, and AllKnownMods (the latter feeds GetModification(id, ModificationNamingConvention.Mixed) via FirstOrDefault) now resolves a different Modification object for collided IDs — potentially with different mass, database references, or a missing UnimodId. This directly intersects the PR's own goal: if a retained MetaMorpheus-sourced instance lacks the UnimodId that the Unimod-sourced duplicate carried, the new FilterByUnimodId resolution path can fail to resolve a UNIMOD token it previously would have matched. The change is silent and app-wide.
Suggested fix:
Confirm the intended intra-protein precedence and make it explicit. If MetaMorpheus-first is intentional, document it in the comment and ensure the new UNIMOD resolution path does not depend on UnimodId being present on the retained instance (e.g. merge UnimodId/metadata across duplicates rather than discarding the loser). If Unimod metadata must remain reachable for FilterByUnimodId, either preserve Unimod-first ordering for the dedup or carry UnimodId onto the winning entry. Add a regression test pinning the precedence.
Affected test scenarios:
- AllKnownProteinModsDictionary - IdWithMotif present in both Unimod and MetaMorpheus sources - asserts which instance (and its UnimodId/mass) is retained
- FilterByUnimodId resolution - sequence
PEPTI[UNIMOD:4]DEwhere the matching IdWithMotif also exists as a MetaMorpheus mod without UnimodId - still resolves to the correct modification - GetModification(id, ModificationNamingConvention.Mixed) - id colliding across sources - returns the documented-precedence instance
- AllProteinModsList ordering - duplicate IdWithMotif across all three sources - deterministic, documented winner
| if (hasParentheses && !hasSquareBrackets) | ||
| return false; | ||
| return IsCrosslinkAnnotatedSequence(input); | ||
|
|
There was a problem hiding this comment.
Major - Broadened CanParse claims parenthetical crosslink sequences without demonstrated round-trip
Problem:
CanParse previously returned false for inputs that have parentheses but no square brackets; the changed branch now returns IsCrosslinkAnnotatedSequence(input), so the mzLib parser now claims sequences like "EKVLTSSAR(2)". This is a behavioral broadening: it both (a) changes which registered parser is selected for parenthetical inputs in SequenceConversionService.Default, and (b) requires that the downstream parse path actually consume the (\d+) annotations rather than treating (, ), and digits as base-sequence content. ParseModificationString only handles bracket content and is unchanged, so nothing in the visible diff demonstrates the parenthetical annotation is parsed/stripped.
Impact:
If the parse path does not strip the (\d+) annotations, accepting them in CanParse converts a previously-clean rejection into a silent mis-parse (annotation digits/parentheses embedded in the base sequence) on the centralized pipeline used by the PeptideWithSetModifications/OligoWithSetMods constructors. Even if parsing is correct, parser-selection precedence for crosslink strings has changed and is unverified.
Suggested fix:
Add explicit round-trip tests that take a crosslink-annotated string through the full centralized path (parse → ToOneIsNterminusModificationDictionary / re-serialize) and assert the base sequence excludes the parenthetical annotation; confirm SequenceConversionService.Default still routes the string to the intended parser.
Affected test scenarios:
- ParseToOneIsNterminusModificationDictionary -
"EKVLTSSAR(2)"- base sequence isEKVLTSSAR, no stray(2)characters - SequenceConversionService.Default selection -
"EKVLTSSAR(2)"- resolves to the intended parser (not hijacked/regressed) - CanParse -
"PEP[Oxidation on M]TIDEK(3)"(brackets + parens) - still handled via balanced-bracket path
| { | ||
| var canonical = SequenceConversionService.Default.ParseAutoDetect(fullSequence); | ||
| return MzLibSequenceSerializer.Instance.ToOneIsNterminusModificationDictionary(canonical!.Value, allModsKnown); | ||
| } |
There was a problem hiding this comment.
Major - canonical!.Value can throw an exception that escapes the catch and breaks the documented contract
Problem:
var canonical = SequenceConversionService.Default.ParseAutoDetect(fullSequence) is dereferenced with canonical!.Value. The null-forgiving operator ! is only needed because ParseAutoDetect has a nullable return type — i.e. the API can return null (a CanonicalSequence? that is empty) when auto-detection finds no matching parser. In that case .Value throws InvalidOperationException ("Nullable object must have a value"), which is not a SequenceConversionException and therefore is not caught by the catch (SequenceConversionException e) block on line 114. The method's XML doc explicitly promises MzLibException "When a full sequence is not in the correct format," so an unparseable input must surface as MzLibException, not a raw InvalidOperationException.
Impact:
A malformed or unrecognized-format full sequence (the exact failure case the old manual parser handled by throwing MzLibException) now throws an uncaught InvalidOperationException/NullReferenceException instead. Callers in the Koina/prediction and digestion pipelines that catch MzLibException will no longer catch it, turning a recoverable parse error into an unhandled crash and silently regressing backward compatibility.
Suggested fix:
Either guarantee ParseAutoDetect throws SequenceConversionException on failure (and drop the !), or explicitly handle a null/valueless result here by throwing the wrapped MzLibException with the "Could not find modification while reading string" message, so every unparseable input exits through the documented exception type.
Affected test scenarios:
- GetModificationDictionaryFromFullSequence - input is gibberish/unrecognized format that yields a null ParseAutoDetect result - expect
MzLibException(notInvalidOperationException) - GetModificationDictionaryFromFullSequence - input has an unterminated bracket like
PEPTI[UNIMOD:4DE- expectMzLibException - GetModificationDictionaryFromFullSequence - valid UNIMOD sequence
PEPTI[UNIMOD:4]DE- expect non-null dictionary with the mod at the correct OneIsNterminus index
| { | ||
| Assert.That(_parser.CanParse("PEPTIDE"), Is.False); | ||
| } | ||
|
|
There was a problem hiding this comment.
Major - Ground-truth parse tests named "SetsCorrectUnimodId" never assert any UnimodId
Problem:
Parse_GroundTruthUpperCase_SetsCorrectUnimodId, Parse_GroundTruthCamelCase_SetsCorrectUnimodId, Parse_GroundTruthLowerCase_SetsCorrectUnimodId, and Parse_GroundTruthNoLabel_SetsCorrectUnimodId claim, by name, to verify that the parser assigns the correct UnimodId. None of them ever reads modification.UnimodId. The upper-case variant only asserts BaseSequence and Modifications.Length; the camelCase/lowerCase/noLabel variants assert only BaseSequence. Setting UnimodId is the entire purpose of UnimodSequenceParser (it is what enables the FilterByUnimodId resolution path the PR depends on), yet that value is never checked.
Impact:
A regression where the parser extracts the base sequence and the right number of bracket tokens but assigns a wrong or null UnimodId (e.g., off-by-one digit parsing, dropping the integer, swapping camelCase handling) would pass all four data-driven suites. The headline functionality is effectively unverified by these tests.
Suggested fix:
For each parsed modification in these tests, assert the expected UnimodId against the ground-truth case (add an expected-id list/field to the test data, or at minimum assert the id at a known position equals the value embedded in the format string).
Affected test scenarios:
- Parse_GroundTruthUpperCase - PEPC[UNIMOD:4]IDE - modification at C has UnimodId == 4
- Parse_GroundTruthLowerCase - PEPC[unimod:4]IDE - modification at C has UnimodId == 4
- Parse_GroundTruthNoLabel - PEPC[4]IDE - modification at C has UnimodId == 4
| var oxMod = Mods.AllKnownProteinModsDictionary["Oxidation on M"]; | ||
| var sequence = new CanonicalSequenceBuilder("PEPTMIDE") | ||
| .AddResidueModification(1, "UnknownMod") | ||
| .AddResidueModification(4, "Oxidation", mzLibModification: oxMod) |
There was a problem hiding this comment.
Major - Resolution-priority test does not create contention, so precedence is never proven
Problem:
ResolveModificationsForProjection_FallbackResolvesBeforePrimaryLookup claims to verify that caller knownMods win over the serializer's lookup (the PR's stated priority change). The setup makes this impossible to verify: the serializer's SelectiveResolveLookup is keyed on "Oxidation", but the modification's OriginalRepresentation is "Oxidation on M". SelectiveResolveLookup.TryResolve does an exact TryGetValue on OriginalRepresentation, so the primary lookup returns null and cannot resolve this mod at all. Only knownMods can resolve it, so the test passes identically whether knownMods are consulted first or last — it proves nothing about precedence. The naming is also inverted versus the PR ("knownMods first" is labeled the "fallback").
Impact:
The PR's core resolution-priority change ("prevents custom mods like Metal:Cation:Fe[III] on X from matching wrong global entries") is exactly the contention case — both lookups can resolve the same token to different mods — and no test exercises it. A regression flipping the order would not be caught.
Suggested fix:
Construct a true conflict: have both the serializer lookup and knownMods resolve the same OriginalRepresentation to two different Modification instances, then assert the resolved result is the one from knownMods. Realign the test/parameter names with the PR's "knownMods first" terminology.
Affected test scenarios:
- ResolveModificationsForProjection - same token resolvable by both lookups to different mods - knownMods result wins
- ResolveModificationsForProjection - knownMods empty, serializer lookup resolves - serializer result used
| #region Resolution by Formula Tests | ||
|
|
||
| [Test] | ||
| public void TryResolve_ByFormula_ResolvesFromDictionary() |
There was a problem hiding this comment.
Major - No regression test for FilterByUnimodId substring collision (4 matching 14/24/44)
Problem:
The PR's headline FilterByUnimodId fix changes a Contains match to an exact match so that id 4 no longer matches 14, 24, 34, 44.... The UNIMOD resolution tests only cover positive resolution (TryResolve_ByUnimodId_ResolvesFromDictionary, …ByUnimodIdInAccession…) and a not-present id (TryResolve_UnimodIdNotInDictionary_ReturnsNull with id 999). There is no test that places a mod with a superstring UNIMOD id (e.g., UNIMOD:44) in the dictionary and asserts that resolving unimodId: 4 does not match it.
Impact:
The exact bug the PR fixes — a short id substring-matching a longer registered id — has no guard. A future revert to substring matching, or a regex/StartsWith reintroduction, would silently mis-resolve and these tests would stay green.
Suggested fix:
Add a test with a dictionary containing a mod whose UNIMOD database reference/accession is 44 (and ideally 14), resolve a canonical mod with unimodId: 4, and assert the result is null (or resolves only to the genuine id-4 entry when both are present).
Affected test scenarios:
- TryResolve_ByUnimodId - dictionary holds only UNIMOD:44, query unimodId 4 - returns null
- TryResolve_ByUnimodId - dictionary holds both UNIMOD:4 and UNIMOD:44, query 4 - resolves to the id-4 mod
Problem
PeptideWithSetModifications and OligoWithSetMods string constructors crashed when given UNIMOD-format sequences (e.g., PEPTI[UNIMOD:4]DE). The GetModificationDictionaryFromFullSequence method extracted bracket content like UNIMOD:4, stripped the prefix to get modId = "4", then looked up in the mzLib-keyed dictionary ("Oxidation on M") and threw MzLibException. This blocked the Koina/FragmentIntensityModel prediction pipeline, which stores ValidatedFullSequence in UNIMOD format.
Changes
New UNIMOD parser (UnimodSequenceParser.cs)
Serializer mod-dictionary projection (ISequenceSerializer, SequenceSerializerBase)
Centralized parsing path (IBioPolymerWithSetMods.cs, SequenceConversionExtensions.cs)
Bug fixes (ModificationLookupBase.cs, MzLibSequenceSerializer.cs)