Skip to content

Intenenal ions squared#1022

Open
trishorts wants to merge 45 commits into
smith-chem-wisc:masterfrom
trishorts:intenenalIonsSquared
Open

Intenenal ions squared#1022
trishorts wants to merge 45 commits into
smith-chem-wisc:masterfrom
trishorts:intenenalIonsSquared

Conversation

@trishorts

@trishorts trishorts commented Feb 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds internal fragment ion prediction and analysis to mzLib. Internal fragments arise from two simultaneous backbone cleavages, producing an ion that contains neither terminus of the peptide (annotated here as bIb[start-end]+charge, e.g. bIb[3-7]+1 spans residues 3–7). Standard prediction models (Prosit, ms2pip, …) only model the terminal b/y series; this PR contributes the internal series so a predicted spectral library can include both.

The work has three independent pieces:

  1. InternalFragmentIntensityModel — a local ONNX model that predicts internal-ion intensities.
  2. MixedModels framework — merges a primary (Koina) model's b/y ions with the internal model's ions into one LibrarySpectrum per peptide.
  3. InternalIons analysis pipeline — post-search extraction of observed internal ions from PSMs, with feature engineering and TSV output (used to train/validate the model).

1. InternalFragmentIntensityModel (local ONNX)

PredictionClients/LocalModels/InternalFragmentIntensityModel.cs + shipped model InternalFragmentScorer_v3_AllProteases.onnx.

Predicts TIC-normalized intensity (TicNI) for internal-fragment candidates of length 3–9. It is protease-agnostic (trained on LysC, Trypsin, ArgC, GluC, AspN, Chymotrypsin) and NCE-agnostic (collision energy is not an explicit input). The model scores 18 engineered features per candidate (fragment length, basic-residue counts, terminal-ion support, proline/aspartate flags, hydrophobicity, etc.), keeps the top N per peptide (default 20), and normalizes intensities to max = 1.0. Each surviving ion becomes a Product with SecondaryProductType/SecondaryFragmentNumber set, which is how mzLib marks an ion as internal (MatchedFragmentIon.IsInternalFragment == true).

using PredictionClients.LocalModels;

var peptides = new List<string> { "PEPTIDEK", "ELVISLIVESK" };
var charges  = new List<int>    { 2, 2 };
var rts      = new List<double?>{ 30.1, 42.5 };

// onnxModelPath is resolved automatically next to the assembly if omitted
var model = new InternalFragmentIntensityModel(
    peptides, charges, rts, out var warnings,
    minIntensityFilter: 0.002,
    maxInternalIonsPerPeptide: 20);

await model.RunInferenceAsync();          // single-shot; disposes itself afterward
var spectra = model.PredictedSpectra;     // List<LibrarySpectrum>, internal ions only

Pass spectralLibrarySavePath: to also write the library to an MSP file.


2. MixedModels — combine primary + internal ions

PredictionClients/MixedModels/ (CombinedLibraryModel, PrimaryIntensityComponent / InternalIntensityComponent, LibrarySpectrumMerger, MixedModelResult, IMixedModelComponent).

CombinedLibraryModel runs each component in parallel and unions their fragment ions per peptide (keyed by sequence/charge), so the merged spectrum carries both b/y and internal ions. A failing component (e.g. a Koina outage) is captured rather than aborting the run, so you still get an internal-only library.

using PredictionClients.MixedModels;
using PredictionClients.Koina.SupportedModels.FragmentIntensityModels;

var internalModel = new InternalFragmentIntensityModel(peptides, charges, rts, out _);
var primaryModel  = new Prosit2020IntensityHCD();   // any Koina FragmentIntensityModel

var combined = CombinedLibraryModel.WithPrimaryAndInternalFragments(
    primaryModel, internalModel,
    collisionEnergy: 30,                        // passed to the primary model
    spectralLibrarySavePath: "combined.msp");   // optional

var warning  = await combined.RunAsync();
var spectra  = combined.PredictedSpectra;       // primary b/y + internal ions, merged

The factory derives the primary model's inputs and aligned retention times from the internal model (both operate on the same peptide set). For full control, construct the component list directly via the general CombinedLibraryModel(IReadOnlyList<IMixedModelComponent>, …) constructor.


3. InternalIons analysis pipeline (observed ions)

Readers/InternalIons/ (InternalFragmentIon, InternalFragmentFeatureExtractor, InternalFragmentAnalysisRunner, InternalFragmentTsvWriter). Independent of the Koina/prediction code.

Given a search result and the raw files, it extracts the internal fragments that were actually matched, computes the same 18 features the model uses, and writes a TSV — the data behind training/validating the ONNX model.

using Readers.InternalIons;

var ions = InternalFragmentAnalysisRunner.Run(
    psmTsvPath:       @"…\AllPSMs.psmtsv",
    rawFileFolder:    @"…\rawfiles",
    outputDirectory:  @"…\out");          // writes InternalFragmentIons.tsv

Notes

  • This branch was merged up to current master and the prediction/MixedModels code was re-ported onto the rewritten Koina FragmentIntensityModel API (Predict(inputs) + GenerateLibrarySpectraFromPredictions(alignedRetentionTimes, …)); behavior is preserved.
  • The internal-ion MSP round-trip is supported by the additions to LibrarySpectrum and SpectralLibrary.ReadFragmentIon (parses/writes bIb[start-end] annotations).
  • Live-Koina and machine-specific evaluation tests are categorized/[Explicit] and excluded from normal runs.

@codecov

codecov Bot commented Feb 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 39.23648% with 573 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.39%. Comparing base (92be818) to head (b3bf111).

⚠️ Current head b3bf111 differs from pull request most recent head f03ec80

Please upload reports for the commit f03ec80 to get more accurate results.

Files with missing lines Patch % Lines
...s/InternalIons/InternalFragmentFeatureExtractor.cs 0.00% 277 Missing ⚠️
mzLib/Readers/InternalIons/InternalIon.cs 0.00% 118 Missing ⚠️
.../Readers/InternalIons/InternalFragmentTsvWriter.cs 0.00% 79 Missing ⚠️
...ers/InternalIons/InternalFragmentAnalysisRunner.cs 0.00% 58 Missing ⚠️
...nts/LocalModels/InternalFragmentIntensityModel .cs 90.36% 26 Missing and 6 partials ⚠️
mzLib/Readers/SpectralLibrary/SpectralLibrary.cs 84.31% 5 Missing and 3 partials ⚠️
mzLib/Omics/SpectralLibrary/LibrarySpectrum.cs 96.42% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1022      +/-   ##
==========================================
- Coverage   81.63%   80.39%   -1.25%     
==========================================
  Files         369      302      -67     
  Lines       47605    42640    -4965     
  Branches     5649     4824     -825     
==========================================
- Hits        38863    34281    -4582     
+ Misses       7643     7512     -131     
+ Partials     1099      847     -252     
Files with missing lines Coverage Δ
mzLib/Omics/SpectralLibrary/LibrarySpectrum.cs 95.55% <96.42%> (+1.49%) ⬆️
mzLib/Readers/SpectralLibrary/SpectralLibrary.cs 85.49% <84.31%> (+0.11%) ⬆️
...nts/LocalModels/InternalFragmentIntensityModel .cs 90.36% <90.36%> (ø)
...ers/InternalIons/InternalFragmentAnalysisRunner.cs 0.00% <0.00%> (ø)
.../Readers/InternalIons/InternalFragmentTsvWriter.cs 0.00% <0.00%> (ø)
mzLib/Readers/InternalIons/InternalIon.cs 0.00% <0.00%> (ø)
...s/InternalIons/InternalFragmentFeatureExtractor.cs 0.00% <0.00%> (ø)

... and 150 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

trishorts and others added 10 commits June 3, 2026 06:35
# Conflicts:
#	mzLib/PredictionClients/Koina/AbstractClasses/FragmentIntensityModel.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/SpectralLibraryData/librarySpectrumInternalIons.msp
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/SpectralLibraryReaderTest.cs
#	mzLib/Test/KoinaTests/RetentionTimePrediction/Prosit2019iRTTests.cs
#	mzLib/Test/MassSpectrometryTests/TestSpectralSimilarity.cs
#	mzLib/Test/Test.csproj
#	mzLib/Test/TestLibrarySpectrum.cs
#	mzLib/mzLib.nuspec
Upstream master replaced the FragmentIntensityModel/KoinaModelBase abstraction
(constructor-fed inputs + RunInferenceAsync + virtual GenerateLibrarySpectraFrom
Predictions(out)) with a Predict(inputs)/GenerateLibrarySpectraFromPredictions(
alignedRetentionTimes, ...) flow. Re-port the internal-ions feature onto it so the
merged branch compiles and the feature behaves as before:

- InternalFragmentIntensityModel: implement the new abstract members (batching/
  throttling props, ToBatchedRequests(validInputs), mapping/handling modes), pass an
  ISequenceConverter to base, keep its own input/output state as plain (non-override)
  properties, reuse the base Predictions list, and rename its internal-ion spectra
  builder to GenerateInternalLibrarySpectraFromPredictions (base method is no longer
  virtual). Provide a local IDisposable + inline MSP save (SavePredictedSpectralLibrary
  no longer exists on the base). Update PeptideFragmentIntensityPrediction construction
  for the added ValidatedFullSequence field.
- PrimaryIntensityComponent: drive the wrapped Koina model via Predict + Generate
  LibrarySpectraFromPredictions(alignedRetentionTimes) instead of the removed
  RunInferenceAsync/PredictedSpectra.
- CombinedLibraryModel.WithPrimaryAndInternalFragments: derive the primary model's
  inputs and aligned RTs from the internal model (same peptide set); add optional
  collisionEnergy.
- Tests: construct Prosit2020IntensityHCD/Prosit2019iRT via the new API; pass
  collisionEnergy through the factory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…minalAcidic

Neither flag was an input to the trained ONNX model (ComputeFeatures emits a
fixed 18-feature vector containing neither) and neither had recorded predictive
value. HasMetalOnTerminalAcidic was additionally dead: its EndsWith("on D"/"on E")
test ran against strings already suffixed with " at position N", so it was
hard-wired false. Remove both properties, their TSV header/value columns, the
feature-extractor population, and the Step11 correlation references.

Addresses 1 finding from the PR smith-chem-wisc#1022 review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ctra

- InternalFragmentIntensityModel: candidate enumeration anchored spans at the
  N-/C-terminus (i==0 / j==n-1), emitting terminal b/y ions mislabeled as
  internal bIb[...]. Constrain to i>=1 and j<=n-2 so both ends are real
  internal cleavages.
- MixedModelResult.FromSpectra: ToDictionary(s => s.Name) threw on the first
  duplicate Sequence/Charge; since it runs inside each component's try/catch it
  silently became FromError and dropped all of that component's spectra.
  Collapse duplicates last-wins instead.
- PrimaryIntensityComponent: correct the class doc that claimed the component
  disposes the model; the rewritten Predict path does not and the base model is
  not IDisposable.

Addresses 3 findings from the PR smith-chem-wisc#1022 review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fixture

- SpectralLibraryReaderTest: the two internal-ion tests pointed at the pre-merge
  SpectralLibrary\SpectralLibraryData\ path; the master merge relocated the data
  to FileReadingTests\SpectralLibraryTests\SpectralLibraryData\, so they threw
  FileNotFoundException. Repoint to the new path.
- MixedModelEvaluationTests: OneTimeSetUp opened a StreamWriter under a
  developer-local F:\ tree before the Assert.Ignore guard, erroring the whole
  fixture off-machine. Check Directory.Exists(BaseDir) first and ignore cleanly;
  flag the fixture for removal (exploratory troubleshooting only).

Addresses 2 findings from the PR smith-chem-wisc#1022 review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- HasModifiedResidue now keys off ModificationsInInternalFragment (non-empty)
  instead of only the four counted categories, so a Common Fixed / UniProt mod
  no longer reports HasModifiedResidue=false while the mod string is populated.
- ReadFromTsv rethrows a row's parse failure with line + file context instead
  of swallowing it in an empty catch (and guards genuinely blank lines).
- ExtractSingleInternalFragment uses product?.NeutralMass ?? NaN, matching the
  null-tolerant contract of ParseInternalFragmentFromProduct rather than
  dereferencing a possibly-null product.

Addresses 3 minor findings from the PR smith-chem-wisc#1022 review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…op dead using

- GenerateInternalLibrarySpectraFromPredictions resets PredictedSpectra to a
  fresh list so repeated calls no longer accumulate duplicate spectra.
- AllowedPrecursorCharges returns a single static readonly HashSet instead of
  allocating a new set on every access in the constructor validation loop.
- Remove the unused System.ComponentModel using from IMixedModelComponent.

Addresses 3 minor findings from the PR smith-chem-wisc#1022 review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hange

SpectralLibrary.cs carries the substantive internal-ion ReadFragmentIon work;
revert the unrelated stylistic noise so the diff stays focused: restore the
deleted using block (System, System.Collections.Generic, System.IO, System.Linq,
ThermoFisher.CommonCore.Data.Business) and the trailing end-of-file newline.

Addresses 2 cosmetic findings from the PR smith-chem-wisc#1022 review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
InternalFragmentStep11AnalysisTests is an exploratory analysis harness, not a
validating test: its [Test] methods emit a formatted report with no assertions
and read/write a developer-local F:\ tree whose input filename has drifted, so
it self-ignores even on the author's machine. Mark it TODO(remove).

Addresses 2 minor findings from the PR smith-chem-wisc#1022 review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add InternalFragmentPipelineTests covering the four previously-uncovered
Readers/InternalIons files, lifting each from 0%:
- InternalFragmentIon -> 100% (header/value arity, computed properties,
  mass-accuracy filter true/false paths)
- InternalFragmentTsvWriter -> 94.5% (write/read round-trip, missing-file guard)
- InternalFragmentFeatureExtractor -> 91.3% (ExtractFromPsms over the committed
  internalIons.psmtsv via the null-tolerant MsDataFile path; empty-input)
- InternalFragmentAnalysisRunner (Run end-to-end with a renamed small mzML/mgf,
  missing-PSM and missing-raw-folder guards, warn-and-continue on an unreadable
  file). Remaining gap is the Thermo .raw loader arm and an unreachable switch
  default.

No raw spectra committed: reuses internalIons.psmtsv and existing DataFiles
fixtures (sliced_ethcd.mzML, tester.mgf).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant