Skip to content

support UNIMOD format parsing and centralized mod-dictionary projection#1050

Open
nbollis wants to merge 23 commits into
smith-chem-wisc:masterfrom
nbollis:koinaUpdate
Open

support UNIMOD format parsing and centralized mod-dictionary projection#1050
nbollis wants to merge 23 commits into
smith-chem-wisc:masterfrom
nbollis:koinaUpdate

Conversation

@nbollis

@nbollis nbollis commented Apr 16, 2026

Copy link
Copy Markdown
Member

Problem

PeptideWithSetModifications and OligoWithSetMods string constructors crashed when given UNIMOD-format sequences (e.g., PEPTI[UNIMOD:4]DE). The GetModificationDictionaryFromFullSequence method extracted bracket content like UNIMOD:4, stripped the prefix to get modId = "4", then looked up in the mzLib-keyed dictionary ("Oxidation on M") and threw MzLibException. This blocked the Koina/FragmentIntensityModel prediction pipeline, which stores ValidatedFullSequence in UNIMOD format.

Changes

New UNIMOD parser (UnimodSequenceParser.cs)

  • Parses PEPTI[UNIMOD:4]DE into CanonicalSequence with UnimodId set, enabling the existing FilterByUnimodId resolution path
  • Registered in SequenceConversionService.Default with converters to mzLib and UNIMOD serializers
  • MzLibSequenceParser.CanParse now rejects UNIMOD-bracket tokens to prevent format misidentification

Serializer mod-dictionary projection (ISequenceSerializer, SequenceSerializerBase)

  • Added ToOneIsNterminusModificationDictionary() to ISequenceSerializer and virtual implementation in SequenceSerializerBase
  • Maps CanonicalModification positions to OneIsNterminus indices and resolves modifications
  • KnownModsLookup (new class) wraps caller-provided Dictionary<string, Modification> as IModificationLookup
  • Resolution priority: caller's knownMods first, then serializer's global lookup — prevents custom mods (e.g., Metal:Cation:Fe[III] on X) from matching wrong global entries

Centralized parsing path (IBioPolymerWithSetMods.cs, SequenceConversionExtensions.cs)

  • GetModificationDictionaryFromFullSequence now delegates to SequenceConversionService.Default.ParseToOneIsNterminusModificationDictionary()
  • Removed 50+ lines of manual bracket/token parsing and UNIMOD fallback helpers
  • Empty-string fast path preserved; SequenceConversionException wrapped as MzLibException for backward compatibility

Bug fixes (ModificationLookupBase.cs, MzLibSequenceSerializer.cs)

  • FilterByUnimodId: changed Contains to exact match with UNIMOD: prefix stripping (was matching ID 4 to 4, 14, 24, 34, 44...)
  • SelectBestCandidate: returns null when no target residue and candidates span multiple distinct residues (was picking arbitrary match for ambiguous mods like Phospho)
  • MzLibSequenceSerializer.TryNormalizeStrictToken: rejects "UNIMOD" as mzLib mod type, forcing enrichment through lookup pipeline

@codecov

codecov Bot commented Apr 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.88256% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.72%. Comparing base (92be818) to head (3370691).

Files with missing lines Patch % Lines
...ion/Implementations/Unimod/UnimodSequenceParser.cs 80.43% 6 Missing and 3 partials ⚠️
...ceConversion/BaseClasses/ModificationLookupBase.cs 77.77% 2 Missing and 4 partials ⚠️
...ceConversion/BaseClasses/SequenceSerializerBase.cs 98.43% 0 Missing and 2 partials ⚠️
...rsion/Implementations/MzLib/MzLibSequenceParser.cs 81.81% 1 Missing and 1 partial ⚠️
...rsion/Implementations/KnownMods/KnownModsLookup.cs 97.72% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1050      +/-   ##
==========================================
+ Coverage   81.63%   81.72%   +0.08%     
==========================================
  Files         369      371       +2     
  Lines       47605    47830     +225     
  Branches     5649     5696      +47     
==========================================
+ Hits        38863    39088     +225     
+ Misses       7643     7641       -2     
- Partials     1099     1101       +2     
Files with missing lines Coverage Δ
mzLib/Omics/IBioPolymerWithSetMods.cs 98.71% <100.00%> (+2.99%) ⬆️
mzLib/Omics/Modifications/Mods.cs 95.68% <100.00%> (ø)
...quenceConversion/BaseClasses/SequenceParserBase.cs 72.66% <100.00%> (+5.77%) ⬆️
...version/CanonicalSequence/CanonicalModification.cs 87.12% <100.00%> (+0.99%) ⬆️
...n/Implementations/MzLib/MzLibSequenceSerializer.cs 65.21% <100.00%> (+1.58%) ⬆️
...cs/SequenceConversion/SequenceConversionService.cs 93.75% <100.00%> (+1.39%) ⬆️
...rsion/Implementations/KnownMods/KnownModsLookup.cs 97.72% <97.72%> (ø)
...ceConversion/BaseClasses/SequenceSerializerBase.cs 98.36% <98.43%> (+6.98%) ⬆️
...rsion/Implementations/MzLib/MzLibSequenceParser.cs 90.90% <81.81%> (+2.67%) ⬆️
...ceConversion/BaseClasses/ModificationLookupBase.cs 79.32% <77.77%> (-1.22%) ⬇️
... and 1 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nbollis nbollis changed the title Koina update support UNIMOD format parsing and centralized mod-dictionary projection Apr 17, 2026
@nbollis nbollis marked this pull request as ready for review April 22, 2026 22:38
pcruzparri
pcruzparri previously approved these changes Apr 23, 2026

@pcruzparri pcruzparri left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY for patching this!

@nbollis

nbollis commented Apr 29, 2026

Copy link
Copy Markdown
Member Author

@copilot resolve the merge conflicts in this pull request. The testing directory has been rearranged with proper namespaces and grouping. TestPeptideWithSetMods has been split into several classes

@trishorts trishorts left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated code review - 6 finding(s) at severity >= Major.

LoadAllProteinModifications();
AllProteinModsList = UnimodModifications.Concat(UniprotModifications).Concat(MetaMorpheusProteinModifications).ToList();
AllProteinModsList = MetaMorpheusProteinModifications.Concat(UniprotModifications).Concat(UnimodModifications).ToList();
AllKnownProteinModsDictionary = AllProteinModsList

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major - Reordered concat inverts dedup precedence for duplicate IdWithMotif mods

Problem:
The only functional change in this file reverses the source order used to build AllProteinModsList. Base: UnimodModifications.Concat(UniprotModifications).Concat(MetaMorpheusProteinModifications); new: MetaMorpheusProteinModifications.Concat(UniprotModifications).Concat(UnimodModifications). Because the very next statement applies DistinctBy(m => m.IdWithMotif) — which keeps the first element per key — this reversal silently flips which Modification instance is retained whenever an IdWithMotif (e.g. "Oxidation on M", "Phospho on S") exists in more than one source. Previously the Unimod-sourced instance won; now the MetaMorpheus-sourced instance wins. The unchanged comment a few lines below ("Protein mods taking precedence") documents only protein-vs-RNA precedence, so this new intra-protein precedence is undocumented.

Impact:
Every consumer keyed off AllKnownProteinModsDictionary, AllProteinModsList, AllModsKnownDictionary, and AllKnownMods (the latter feeds GetModification(id, ModificationNamingConvention.Mixed) via FirstOrDefault) now resolves a different Modification object for collided IDs — potentially with different mass, database references, or a missing UnimodId. This directly intersects the PR's own goal: if a retained MetaMorpheus-sourced instance lacks the UnimodId that the Unimod-sourced duplicate carried, the new FilterByUnimodId resolution path can fail to resolve a UNIMOD token it previously would have matched. The change is silent and app-wide.

Suggested fix:
Confirm the intended intra-protein precedence and make it explicit. If MetaMorpheus-first is intentional, document it in the comment and ensure the new UNIMOD resolution path does not depend on UnimodId being present on the retained instance (e.g. merge UnimodId/metadata across duplicates rather than discarding the loser). If Unimod metadata must remain reachable for FilterByUnimodId, either preserve Unimod-first ordering for the dedup or carry UnimodId onto the winning entry. Add a regression test pinning the precedence.

Affected test scenarios:

  • AllKnownProteinModsDictionary - IdWithMotif present in both Unimod and MetaMorpheus sources - asserts which instance (and its UnimodId/mass) is retained
  • FilterByUnimodId resolution - sequence PEPTI[UNIMOD:4]DE where the matching IdWithMotif also exists as a MetaMorpheus mod without UnimodId - still resolves to the correct modification
  • GetModification(id, ModificationNamingConvention.Mixed) - id colliding across sources - returns the documented-precedence instance
  • AllProteinModsList ordering - duplicate IdWithMotif across all three sources - deterministic, documented winner

if (hasParentheses && !hasSquareBrackets)
return false;
return IsCrosslinkAnnotatedSequence(input);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major - Broadened CanParse claims parenthetical crosslink sequences without demonstrated round-trip

Problem:
CanParse previously returned false for inputs that have parentheses but no square brackets; the changed branch now returns IsCrosslinkAnnotatedSequence(input), so the mzLib parser now claims sequences like "EKVLTSSAR(2)". This is a behavioral broadening: it both (a) changes which registered parser is selected for parenthetical inputs in SequenceConversionService.Default, and (b) requires that the downstream parse path actually consume the (\d+) annotations rather than treating (, ), and digits as base-sequence content. ParseModificationString only handles bracket content and is unchanged, so nothing in the visible diff demonstrates the parenthetical annotation is parsed/stripped.

Impact:
If the parse path does not strip the (\d+) annotations, accepting them in CanParse converts a previously-clean rejection into a silent mis-parse (annotation digits/parentheses embedded in the base sequence) on the centralized pipeline used by the PeptideWithSetModifications/OligoWithSetMods constructors. Even if parsing is correct, parser-selection precedence for crosslink strings has changed and is unverified.

Suggested fix:
Add explicit round-trip tests that take a crosslink-annotated string through the full centralized path (parse → ToOneIsNterminusModificationDictionary / re-serialize) and assert the base sequence excludes the parenthetical annotation; confirm SequenceConversionService.Default still routes the string to the intended parser.

Affected test scenarios:

  • ParseToOneIsNterminusModificationDictionary - "EKVLTSSAR(2)" - base sequence is EKVLTSSAR, no stray (2) characters
  • SequenceConversionService.Default selection - "EKVLTSSAR(2)" - resolves to the intended parser (not hijacked/regressed)
  • CanParse - "PEP[Oxidation on M]TIDEK(3)" (brackets + parens) - still handled via balanced-bracket path

{
var canonical = SequenceConversionService.Default.ParseAutoDetect(fullSequence);
return MzLibSequenceSerializer.Instance.ToOneIsNterminusModificationDictionary(canonical!.Value, allModsKnown);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major - canonical!.Value can throw an exception that escapes the catch and breaks the documented contract

Problem:
var canonical = SequenceConversionService.Default.ParseAutoDetect(fullSequence) is dereferenced with canonical!.Value. The null-forgiving operator ! is only needed because ParseAutoDetect has a nullable return type — i.e. the API can return null (a CanonicalSequence? that is empty) when auto-detection finds no matching parser. In that case .Value throws InvalidOperationException ("Nullable object must have a value"), which is not a SequenceConversionException and therefore is not caught by the catch (SequenceConversionException e) block on line 114. The method's XML doc explicitly promises MzLibException "When a full sequence is not in the correct format," so an unparseable input must surface as MzLibException, not a raw InvalidOperationException.

Impact:
A malformed or unrecognized-format full sequence (the exact failure case the old manual parser handled by throwing MzLibException) now throws an uncaught InvalidOperationException/NullReferenceException instead. Callers in the Koina/prediction and digestion pipelines that catch MzLibException will no longer catch it, turning a recoverable parse error into an unhandled crash and silently regressing backward compatibility.

Suggested fix:
Either guarantee ParseAutoDetect throws SequenceConversionException on failure (and drop the !), or explicitly handle a null/valueless result here by throwing the wrapped MzLibException with the "Could not find modification while reading string" message, so every unparseable input exits through the documented exception type.

Affected test scenarios:

  • GetModificationDictionaryFromFullSequence - input is gibberish/unrecognized format that yields a null ParseAutoDetect result - expect MzLibException (not InvalidOperationException)
  • GetModificationDictionaryFromFullSequence - input has an unterminated bracket like PEPTI[UNIMOD:4DE - expect MzLibException
  • GetModificationDictionaryFromFullSequence - valid UNIMOD sequence PEPTI[UNIMOD:4]DE - expect non-null dictionary with the mod at the correct OneIsNterminus index

{
Assert.That(_parser.CanParse("PEPTIDE"), Is.False);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major - Ground-truth parse tests named "SetsCorrectUnimodId" never assert any UnimodId

Problem:
Parse_GroundTruthUpperCase_SetsCorrectUnimodId, Parse_GroundTruthCamelCase_SetsCorrectUnimodId, Parse_GroundTruthLowerCase_SetsCorrectUnimodId, and Parse_GroundTruthNoLabel_SetsCorrectUnimodId claim, by name, to verify that the parser assigns the correct UnimodId. None of them ever reads modification.UnimodId. The upper-case variant only asserts BaseSequence and Modifications.Length; the camelCase/lowerCase/noLabel variants assert only BaseSequence. Setting UnimodId is the entire purpose of UnimodSequenceParser (it is what enables the FilterByUnimodId resolution path the PR depends on), yet that value is never checked.

Impact:
A regression where the parser extracts the base sequence and the right number of bracket tokens but assigns a wrong or null UnimodId (e.g., off-by-one digit parsing, dropping the integer, swapping camelCase handling) would pass all four data-driven suites. The headline functionality is effectively unverified by these tests.

Suggested fix:
For each parsed modification in these tests, assert the expected UnimodId against the ground-truth case (add an expected-id list/field to the test data, or at minimum assert the id at a known position equals the value embedded in the format string).

Affected test scenarios:

  • Parse_GroundTruthUpperCase - PEPC[UNIMOD:4]IDE - modification at C has UnimodId == 4
  • Parse_GroundTruthLowerCase - PEPC[unimod:4]IDE - modification at C has UnimodId == 4
  • Parse_GroundTruthNoLabel - PEPC[4]IDE - modification at C has UnimodId == 4

var oxMod = Mods.AllKnownProteinModsDictionary["Oxidation on M"];
var sequence = new CanonicalSequenceBuilder("PEPTMIDE")
.AddResidueModification(1, "UnknownMod")
.AddResidueModification(4, "Oxidation", mzLibModification: oxMod)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major - Resolution-priority test does not create contention, so precedence is never proven

Problem:
ResolveModificationsForProjection_FallbackResolvesBeforePrimaryLookup claims to verify that caller knownMods win over the serializer's lookup (the PR's stated priority change). The setup makes this impossible to verify: the serializer's SelectiveResolveLookup is keyed on "Oxidation", but the modification's OriginalRepresentation is "Oxidation on M". SelectiveResolveLookup.TryResolve does an exact TryGetValue on OriginalRepresentation, so the primary lookup returns null and cannot resolve this mod at all. Only knownMods can resolve it, so the test passes identically whether knownMods are consulted first or last — it proves nothing about precedence. The naming is also inverted versus the PR ("knownMods first" is labeled the "fallback").

Impact:
The PR's core resolution-priority change ("prevents custom mods like Metal:Cation:Fe[III] on X from matching wrong global entries") is exactly the contention case — both lookups can resolve the same token to different mods — and no test exercises it. A regression flipping the order would not be caught.

Suggested fix:
Construct a true conflict: have both the serializer lookup and knownMods resolve the same OriginalRepresentation to two different Modification instances, then assert the resolved result is the one from knownMods. Realign the test/parameter names with the PR's "knownMods first" terminology.

Affected test scenarios:

  • ResolveModificationsForProjection - same token resolvable by both lookups to different mods - knownMods result wins
  • ResolveModificationsForProjection - knownMods empty, serializer lookup resolves - serializer result used

#region Resolution by Formula Tests

[Test]
public void TryResolve_ByFormula_ResolvesFromDictionary()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major - No regression test for FilterByUnimodId substring collision (4 matching 14/24/44)

Problem:
The PR's headline FilterByUnimodId fix changes a Contains match to an exact match so that id 4 no longer matches 14, 24, 34, 44.... The UNIMOD resolution tests only cover positive resolution (TryResolve_ByUnimodId_ResolvesFromDictionary, …ByUnimodIdInAccession…) and a not-present id (TryResolve_UnimodIdNotInDictionary_ReturnsNull with id 999). There is no test that places a mod with a superstring UNIMOD id (e.g., UNIMOD:44) in the dictionary and asserts that resolving unimodId: 4 does not match it.

Impact:
The exact bug the PR fixes — a short id substring-matching a longer registered id — has no guard. A future revert to substring matching, or a regex/StartsWith reintroduction, would silently mis-resolve and these tests would stay green.

Suggested fix:
Add a test with a dictionary containing a mod whose UNIMOD database reference/accession is 44 (and ideally 14), resolve a canonical mod with unimodId: 4, and assert the result is null (or resolves only to the genuine id-4 entry when both are present).

Affected test scenarios:

  • TryResolve_ByUnimodId - dictionary holds only UNIMOD:44, query unimodId 4 - returns null
  • TryResolve_ByUnimodId - dictionary holds both UNIMOD:4 and UNIMOD:44, query 4 - resolves to the id-4 mod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants