support UNIMOD format parsing and centralized mod-dictionary projection by nbollis · Pull Request #1050 · smith-chem-wisc/mzLib

nbollis · 2026-04-16T22:53:05Z

Problem

PeptideWithSetModifications and OligoWithSetMods string constructors crashed when given UNIMOD-format sequences (e.g., PEPTI[UNIMOD:4]DE). The GetModificationDictionaryFromFullSequence method extracted bracket content like UNIMOD:4, stripped the prefix to get modId = "4", then looked up in the mzLib-keyed dictionary ("Oxidation on M") and threw MzLibException. This blocked the Koina/FragmentIntensityModel prediction pipeline, which stores ValidatedFullSequence in UNIMOD format.

Changes

New UNIMOD parser (UnimodSequenceParser.cs)

Parses PEPTI[UNIMOD:4]DE into CanonicalSequence with UnimodId set, enabling the existing FilterByUnimodId resolution path
Registered in SequenceConversionService.Default with converters to mzLib and UNIMOD serializers
MzLibSequenceParser.CanParse now rejects UNIMOD-bracket tokens to prevent format misidentification

Serializer mod-dictionary projection (ISequenceSerializer, SequenceSerializerBase)

Added ToOneIsNterminusModificationDictionary() to ISequenceSerializer and virtual implementation in SequenceSerializerBase
Maps CanonicalModification positions to OneIsNterminus indices and resolves modifications
KnownModsLookup (new class) wraps caller-provided Dictionary<string, Modification> as IModificationLookup
Resolution priority: caller's knownMods first, then serializer's global lookup — prevents custom mods (e.g., Metal:Cation:Fe[III] on X) from matching wrong global entries

Centralized parsing path (IBioPolymerWithSetMods.cs, SequenceConversionExtensions.cs)

GetModificationDictionaryFromFullSequence now delegates to SequenceConversionService.Default.ParseToOneIsNterminusModificationDictionary()
Removed 50+ lines of manual bracket/token parsing and UNIMOD fallback helpers
Empty-string fast path preserved; SequenceConversionException wrapped as MzLibException for backward compatibility

Bug fixes (ModificationLookupBase.cs, MzLibSequenceSerializer.cs)

FilterByUnimodId: changed Contains to exact match with UNIMOD: prefix stripping (was matching ID 4 to 4, 14, 24, 34, 44...)
SelectBestCandidate: returns null when no target residue and candidates span multiple distinct residues (was picking arbitrary match for ambiguous mods like Phospho)
MzLibSequenceSerializer.TryNormalizeStrictToken: rejects "UNIMOD" as mzLib mod type, forcing enrichment through lookup pipeline

…rializers

…serializer projection

…mzLib type, empty-string fast path

…odsLookup

codecov · 2026-04-17T00:56:08Z

Codecov Report

❌ Patch coverage is 92.88256% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.72%. Comparing base (92be818) to head (3370691).

Files with missing lines	Patch %	Lines
...ion/Implementations/Unimod/UnimodSequenceParser.cs	80.43%	6 Missing and 3 partials ⚠️
...ceConversion/BaseClasses/ModificationLookupBase.cs	77.77%	2 Missing and 4 partials ⚠️
...ceConversion/BaseClasses/SequenceSerializerBase.cs	98.43%	0 Missing and 2 partials ⚠️
...rsion/Implementations/MzLib/MzLibSequenceParser.cs	81.81%	1 Missing and 1 partial ⚠️
...rsion/Implementations/KnownMods/KnownModsLookup.cs	97.72%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1050      +/-   ##
==========================================
+ Coverage   81.63%   81.72%   +0.08%     
==========================================
  Files         369      371       +2     
  Lines       47605    47830     +225     
  Branches     5649     5696      +47     
==========================================
+ Hits        38863    39088     +225     
+ Misses       7643     7641       -2     
- Partials     1099     1101       +2

Files with missing lines	Coverage Δ
mzLib/Omics/IBioPolymerWithSetMods.cs	`98.71% <100.00%> (+2.99%)`	⬆️
mzLib/Omics/Modifications/Mods.cs	`95.68% <100.00%> (ø)`
...quenceConversion/BaseClasses/SequenceParserBase.cs	`72.66% <100.00%> (+5.77%)`	⬆️
...version/CanonicalSequence/CanonicalModification.cs	`87.12% <100.00%> (+0.99%)`	⬆️
...n/Implementations/MzLib/MzLibSequenceSerializer.cs	`65.21% <100.00%> (+1.58%)`	⬆️
...cs/SequenceConversion/SequenceConversionService.cs	`93.75% <100.00%> (+1.39%)`	⬆️
...rsion/Implementations/KnownMods/KnownModsLookup.cs	`97.72% <97.72%> (ø)`
...ceConversion/BaseClasses/SequenceSerializerBase.cs	`98.36% <98.43%> (+6.98%)`	⬆️
...rsion/Implementations/MzLib/MzLibSequenceParser.cs	`90.90% <81.81%> (+2.67%)`	⬆️
...ceConversion/BaseClasses/ModificationLookupBase.cs	`79.32% <77.77%> (-1.22%)`	⬇️
... and 1 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…oinaUpdate

pcruzparri

TY for patching this!

nbollis · 2026-04-29T16:34:44Z

@copilot resolve the merge conflicts in this pull request. The testing directory has been rearranged with proper namespaces and grouping. TestPeptideWithSetMods has been split into several classes

trishorts

Automated code review - 6 finding(s) at severity >= Major.

trishorts · 2026-05-26T13:45:20Z

        LoadAllProteinModifications();
-        AllProteinModsList = UnimodModifications.Concat(UniprotModifications).Concat(MetaMorpheusProteinModifications).ToList();
+        AllProteinModsList = MetaMorpheusProteinModifications.Concat(UniprotModifications).Concat(UnimodModifications).ToList();
        AllKnownProteinModsDictionary = AllProteinModsList


Major - Reordered concat inverts dedup precedence for duplicate IdWithMotif mods

Problem:
The only functional change in this file reverses the source order used to build AllProteinModsList. Base: UnimodModifications.Concat(UniprotModifications).Concat(MetaMorpheusProteinModifications); new: MetaMorpheusProteinModifications.Concat(UniprotModifications).Concat(UnimodModifications). Because the very next statement applies DistinctBy(m => m.IdWithMotif) — which keeps the first element per key — this reversal silently flips which Modification instance is retained whenever an IdWithMotif (e.g. "Oxidation on M", "Phospho on S") exists in more than one source. Previously the Unimod-sourced instance won; now the MetaMorpheus-sourced instance wins. The unchanged comment a few lines below ("Protein mods taking precedence") documents only protein-vs-RNA precedence, so this new intra-protein precedence is undocumented.

Impact:
Every consumer keyed off AllKnownProteinModsDictionary, AllProteinModsList, AllModsKnownDictionary, and AllKnownMods (the latter feeds GetModification(id, ModificationNamingConvention.Mixed) via FirstOrDefault) now resolves a different Modification object for collided IDs — potentially with different mass, database references, or a missing UnimodId. This directly intersects the PR's own goal: if a retained MetaMorpheus-sourced instance lacks the UnimodId that the Unimod-sourced duplicate carried, the new FilterByUnimodId resolution path can fail to resolve a UNIMOD token it previously would have matched. The change is silent and app-wide.

Suggested fix:
Confirm the intended intra-protein precedence and make it explicit. If MetaMorpheus-first is intentional, document it in the comment and ensure the new UNIMOD resolution path does not depend on UnimodId being present on the retained instance (e.g. merge UnimodId/metadata across duplicates rather than discarding the loser). If Unimod metadata must remain reachable for FilterByUnimodId, either preserve Unimod-first ordering for the dedup or carry UnimodId onto the winning entry. Add a regression test pinning the precedence.

Affected test scenarios:

AllKnownProteinModsDictionary - IdWithMotif present in both Unimod and MetaMorpheus sources - asserts which instance (and its UnimodId/mass) is retained

FilterByUnimodId resolution - sequence PEPTI[UNIMOD:4]DE where the matching IdWithMotif also exists as a MetaMorpheus mod without UnimodId - still resolves to the correct modification

GetModification(id, ModificationNamingConvention.Mixed) - id colliding across sources - returns the documented-precedence instance

AllProteinModsList ordering - duplicate IdWithMotif across all three sources - deterministic, documented winner

trishorts · 2026-05-26T13:45:20Z

        if (hasParentheses && !hasSquareBrackets)
-            return false;
+            return IsCrosslinkAnnotatedSequence(input);



Major - Broadened CanParse claims parenthetical crosslink sequences without demonstrated round-trip

Problem:
CanParse previously returned false for inputs that have parentheses but no square brackets; the changed branch now returns IsCrosslinkAnnotatedSequence(input), so the mzLib parser now claims sequences like "EKVLTSSAR(2)". This is a behavioral broadening: it both (a) changes which registered parser is selected for parenthetical inputs in SequenceConversionService.Default, and (b) requires that the downstream parse path actually consume the (\d+) annotations rather than treating (, ), and digits as base-sequence content. ParseModificationString only handles bracket content and is unchanged, so nothing in the visible diff demonstrates the parenthetical annotation is parsed/stripped.

Impact:
If the parse path does not strip the (\d+) annotations, accepting them in CanParse converts a previously-clean rejection into a silent mis-parse (annotation digits/parentheses embedded in the base sequence) on the centralized pipeline used by the PeptideWithSetModifications/OligoWithSetMods constructors. Even if parsing is correct, parser-selection precedence for crosslink strings has changed and is unverified.

Suggested fix:
Add explicit round-trip tests that take a crosslink-annotated string through the full centralized path (parse → ToOneIsNterminusModificationDictionary / re-serialize) and assert the base sequence excludes the parenthetical annotation; confirm SequenceConversionService.Default still routes the string to the intended parser.

Affected test scenarios:

ParseToOneIsNterminusModificationDictionary - "EKVLTSSAR(2)" - base sequence is EKVLTSSAR, no stray (2) characters

SequenceConversionService.Default selection - "EKVLTSSAR(2)" - resolves to the intended parser (not hijacked/regressed)

CanParse - "PEP[Oxidation on M]TIDEK(3)" (brackets + parens) - still handled via balanced-bracket path

trishorts · 2026-05-26T13:45:20Z

+            {
+                var canonical = SequenceConversionService.Default.ParseAutoDetect(fullSequence);
+                return MzLibSequenceSerializer.Instance.ToOneIsNterminusModificationDictionary(canonical!.Value, allModsKnown);
+            }


Major - canonical!.Value can throw an exception that escapes the catch and breaks the documented contract

Problem:
var canonical = SequenceConversionService.Default.ParseAutoDetect(fullSequence) is dereferenced with canonical!.Value. The null-forgiving operator ! is only needed because ParseAutoDetect has a nullable return type — i.e. the API can return null (a CanonicalSequence? that is empty) when auto-detection finds no matching parser. In that case .Value throws InvalidOperationException ("Nullable object must have a value"), which is not a SequenceConversionException and therefore is not caught by the catch (SequenceConversionException e) block on line 114. The method's XML doc explicitly promises MzLibException "When a full sequence is not in the correct format," so an unparseable input must surface as MzLibException, not a raw InvalidOperationException.

Impact:
A malformed or unrecognized-format full sequence (the exact failure case the old manual parser handled by throwing MzLibException) now throws an uncaught InvalidOperationException/NullReferenceException instead. Callers in the Koina/prediction and digestion pipelines that catch MzLibException will no longer catch it, turning a recoverable parse error into an unhandled crash and silently regressing backward compatibility.

Suggested fix:
Either guarantee ParseAutoDetect throws SequenceConversionException on failure (and drop the !), or explicitly handle a null/valueless result here by throwing the wrapped MzLibException with the "Could not find modification while reading string" message, so every unparseable input exits through the documented exception type.

Affected test scenarios:

GetModificationDictionaryFromFullSequence - input is gibberish/unrecognized format that yields a null ParseAutoDetect result - expect MzLibException (not InvalidOperationException)

GetModificationDictionaryFromFullSequence - input has an unterminated bracket like PEPTI[UNIMOD:4DE - expect MzLibException

GetModificationDictionaryFromFullSequence - valid UNIMOD sequence PEPTI[UNIMOD:4]DE - expect non-null dictionary with the mod at the correct OneIsNterminus index

trishorts · 2026-05-26T13:45:20Z

+    {
+        Assert.That(_parser.CanParse("PEPTIDE"), Is.False);
+    }
+


Major - Ground-truth parse tests named "SetsCorrectUnimodId" never assert any UnimodId

Problem:
Parse_GroundTruthUpperCase_SetsCorrectUnimodId, Parse_GroundTruthCamelCase_SetsCorrectUnimodId, Parse_GroundTruthLowerCase_SetsCorrectUnimodId, and Parse_GroundTruthNoLabel_SetsCorrectUnimodId claim, by name, to verify that the parser assigns the correct UnimodId. None of them ever reads modification.UnimodId. The upper-case variant only asserts BaseSequence and Modifications.Length; the camelCase/lowerCase/noLabel variants assert only BaseSequence. Setting UnimodId is the entire purpose of UnimodSequenceParser (it is what enables the FilterByUnimodId resolution path the PR depends on), yet that value is never checked.

Impact:
A regression where the parser extracts the base sequence and the right number of bracket tokens but assigns a wrong or null UnimodId (e.g., off-by-one digit parsing, dropping the integer, swapping camelCase handling) would pass all four data-driven suites. The headline functionality is effectively unverified by these tests.

Suggested fix:
For each parsed modification in these tests, assert the expected UnimodId against the ground-truth case (add an expected-id list/field to the test data, or at minimum assert the id at a known position equals the value embedded in the format string).

Affected test scenarios:

Parse_GroundTruthUpperCase - PEPC[UNIMOD:4]IDE - modification at C has UnimodId == 4

Parse_GroundTruthLowerCase - PEPC[unimod:4]IDE - modification at C has UnimodId == 4

Parse_GroundTruthNoLabel - PEPC[4]IDE - modification at C has UnimodId == 4

trishorts · 2026-05-26T13:45:20Z

+        var oxMod = Mods.AllKnownProteinModsDictionary["Oxidation on M"];
+        var sequence = new CanonicalSequenceBuilder("PEPTMIDE")
+            .AddResidueModification(1, "UnknownMod")
+            .AddResidueModification(4, "Oxidation", mzLibModification: oxMod)


Major - Resolution-priority test does not create contention, so precedence is never proven

Problem:
ResolveModificationsForProjection_FallbackResolvesBeforePrimaryLookup claims to verify that caller knownMods win over the serializer's lookup (the PR's stated priority change). The setup makes this impossible to verify: the serializer's SelectiveResolveLookup is keyed on "Oxidation", but the modification's OriginalRepresentation is "Oxidation on M". SelectiveResolveLookup.TryResolve does an exact TryGetValue on OriginalRepresentation, so the primary lookup returns null and cannot resolve this mod at all. Only knownMods can resolve it, so the test passes identically whether knownMods are consulted first or last — it proves nothing about precedence. The naming is also inverted versus the PR ("knownMods first" is labeled the "fallback").

Impact:
The PR's core resolution-priority change ("prevents custom mods like Metal:Cation:Fe[III] on X from matching wrong global entries") is exactly the contention case — both lookups can resolve the same token to different mods — and no test exercises it. A regression flipping the order would not be caught.

Suggested fix:
Construct a true conflict: have both the serializer lookup and knownMods resolve the same OriginalRepresentation to two different Modification instances, then assert the resolved result is the one from knownMods. Realign the test/parameter names with the PR's "knownMods first" terminology.

Affected test scenarios:

ResolveModificationsForProjection - same token resolvable by both lookups to different mods - knownMods result wins

ResolveModificationsForProjection - knownMods empty, serializer lookup resolves - serializer result used

trishorts · 2026-05-26T13:45:20Z

+    #region Resolution by Formula Tests
+
+    [Test]
+    public void TryResolve_ByFormula_ResolvesFromDictionary()


Major - No regression test for FilterByUnimodId substring collision (4 matching 14/24/44)

Problem:
The PR's headline FilterByUnimodId fix changes a Contains match to an exact match so that id 4 no longer matches 14, 24, 34, 44.... The UNIMOD resolution tests only cover positive resolution (TryResolve_ByUnimodId_ResolvesFromDictionary, …ByUnimodIdInAccession…) and a not-present id (TryResolve_UnimodIdNotInDictionary_ReturnsNull with id 999). There is no test that places a mod with a superstring UNIMOD id (e.g., UNIMOD:44) in the dictionary and asserts that resolving unimodId: 4 does not match it.

Impact:
The exact bug the PR fixes — a short id substring-matching a longer registered id — has no guard. A future revert to substring matching, or a regex/StartsWith reintroduction, would silently mis-resolve and these tests would stay green.

Suggested fix:
Add a test with a dictionary containing a mod whose UNIMOD database reference/accession is 44 (and ideally 14), resolve a canonical mod with unimodId: 4, and assert the result is null (or resolves only to the genuine id-4 entry when both are present).

Affected test scenarios:

TryResolve_ByUnimodId - dictionary holds only UNIMOD:44, query unimodId 4 - returns null

TryResolve_ByUnimodId - dictionary holds both UNIMOD:4 and UNIMOD:44, query 4 - resolves to the id-4 mod

nbollis added 14 commits April 16, 2026 15:41

fix(dedup annotations)

2133c49

fix(redundancy): Removed redundant base sequence stripping

752ad13

fix(sequence-conversion): support UNIMOD parsing and mod resolution

962cf50

refactor(sequence-conversion): centralize OneIsNterminus mod mapping

c7cfa26

Mod Ordering fix

b5ef3c6

refactor(sequence-conversion): move mod-dictionary projection into se…

ec9b953

…rializers

refactor(sequence-conversion): route with-set-mod conversion through …

6dc17cb

…serializer projection

refactor(sequence-conversion): simplify serializer projection resolution

80116f1

fix(koina tests): Put base stripper back in

ba8cf36

Fix some broken tests and add unimod filter in mzlib parser

09b69f3

fix(sequence-conversion): exact UNIMOD ID matching, reject UNIMOD as …

53f6cd1

…mzLib type, empty-string fast path

refactor(sequence-conversion): replace serializer helpers with KnownM…

2b04684

…odsLookup

PR Cleanup 1

e71cfc8

small fixes

069a50b

nbollis changed the title ~~Koina update~~ support UNIMOD format parsing and centralized mod-dictionary projection Apr 17, 2026

nbollis added the enhancement label Apr 17, 2026

nbollis added 7 commits April 17, 2026 11:57

Tests

9798e07

fix(sequence-conversion): tolerate legacy dangling '-' in mzLib parser

0fd8d45

Merge branch 'master' into koinaUpdate

ef18575

Support crosslink sequences

0861bc3

Merge branch 'koinaUpdate' of https://github.com/nbollis/mzLib into k…

8b3ed1b

…oinaUpdate

Database aliasing

db15a19

one more catch

9d53216

nbollis marked this pull request as ready for review April 22, 2026 22:38

nbollis added ready for review Refactor labels Apr 22, 2026

pcruzparri previously approved these changes Apr 23, 2026

View reviewed changes

merge in master

9660782

nbollis dismissed pcruzparri’s stale review via 9660782 May 1, 2026 17:21

trishorts requested changes May 26, 2026

View reviewed changes

Merge branch 'master' into koinaUpdate

3370691

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support UNIMOD format parsing and centralized mod-dictionary projection#1050

support UNIMOD format parsing and centralized mod-dictionary projection#1050
nbollis wants to merge 23 commits into
smith-chem-wisc:masterfrom
nbollis:koinaUpdate

nbollis commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

pcruzparri left a comment

Uh oh!

nbollis commented Apr 29, 2026

Uh oh!

trishorts left a comment

Uh oh!

trishorts May 26, 2026

Uh oh!

trishorts May 26, 2026

Uh oh!

trishorts May 26, 2026

Uh oh!

trishorts May 26, 2026

Uh oh!

trishorts May 26, 2026

Uh oh!

trishorts May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

nbollis commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

New UNIMOD parser (UnimodSequenceParser.cs)

Serializer mod-dictionary projection (ISequenceSerializer, SequenceSerializerBase)

Centralized parsing path (IBioPolymerWithSetMods.cs, SequenceConversionExtensions.cs)

Bug fixes (ModificationLookupBase.cs, MzLibSequenceSerializer.cs)

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pcruzparri left a comment

Choose a reason for hiding this comment

Uh oh!

nbollis commented Apr 29, 2026

Uh oh!

trishorts left a comment

Choose a reason for hiding this comment

Uh oh!

trishorts May 26, 2026

Choose a reason for hiding this comment

Uh oh!

trishorts May 26, 2026

Choose a reason for hiding this comment

Uh oh!

trishorts May 26, 2026

Choose a reason for hiding this comment

Uh oh!

trishorts May 26, 2026

Choose a reason for hiding this comment

Uh oh!

trishorts May 26, 2026

Choose a reason for hiding this comment

Uh oh!

trishorts May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nbollis commented Apr 16, 2026 •

edited

Loading

codecov Bot commented Apr 17, 2026 •

edited

Loading