Skip to content

Add MslDbBuilder: FASTA→.msl spectral-library builder for ManySearchTask#1076

Open
trishorts wants to merge 92 commits into
smith-chem-wisc:masterfrom
trishorts:msldbbuilder
Open

Add MslDbBuilder: FASTA→.msl spectral-library builder for ManySearchTask#1076
trishorts wants to merge 92 commits into
smith-chem-wisc:masterfrom
trishorts:msldbbuilder

Conversation

@trishorts

Copy link
Copy Markdown
Contributor

What this PR adds

MslDbBuilder — a standalone .NET 8 console tool that turns protein FASTA databases into mzLib .msl spectral libraries. It is the offline "library compiler" for MetaMorpheus's ManySearchTask: instead of digesting thousands of FASTAs at search time, the search reads pre-built .msl files (precursor mass + charge + Chronologer iRT + sequence, optionally with stored fragments).

It is added as a new project under mzLib/MslDbBuilder/ referencing UsefulProteomicsDatabases and Readers, and it uses the .msl writer/reader API introduced in PR #1036. This PR is therefore stacked on #1036 (mzlib_speclib) — until that merges, the diff here also shows the .msl format work; it collapses to just the builder once #1036 lands. Please review/merge #1036 first.

Plain-language explanation (what it does and why)

A spectral-library search needs a peptide source: the list of candidate peptides (with their fragment masses) to compare against each spectrum. Normally MetaMorpheus generates that on the fly by digesting a FASTA every run. For ManySearchTask we want to search one sample against thousands of organisms at once (e.g. all bacteria), and re-digesting tens of thousands of FASTAs on every search is wasteful and slow.

MslDbBuilder does that digestion once, offline, and writes the result as compact indexed .msl files:

  • Reads each FASTA and digests it (trypsin, 0 missed cleavages, length ≥ 7, variable init-Met, Carbamidomethyl on C / variable Oxidation on M, max 2 mods), generating target and reversed-decoy peptides.
  • Predicts each peptide's retention time once with Chronologer (an iRT model) and stores it in the index, so the search can use an RT pre-filter.
  • Writes a lean library by default: precursor mass + charge + iRT + sequence, no stored fragments (the search re-fragments the handful of candidates that survive the precursor+RT filter — far cheaper than storing fragments for billions of peptides).
  • Can emit one .msl per organism, merge many organisms into a single indexed file (each entry tagged db|accession so the search can still report per-organism hits), or build directly into size-balanced shards for production scale.

The payoff at search time: thousands of "databases" load as a few big indexed files instead of thousands of FASTA opens, and most candidates are eliminated by a learned precursor-mass + RT gate before any fragmentation happens.

Modes

  • MslDbBuilder [--verify] [--lean] <outDir> <fasta...> — FASTA → one .msl per database.
  • --shardbuild <outDir> <numShards> <fasta...|@listfile> — build N lean, size-balanced merged shards directly from FASTAs, streaming, each entry tagged db|accession, with Chronologer iRT (the production path).
  • --merge <out.msl> <msl...> — combine many .msl into one merged file.
  • --read / --probe / --checkrt / --rtcalib / --streamtest — round-trip, inspection, RT-calibration and writer-test utilities.

How we arrived at the production design (160 shards) — the sizing journey

The goal was a single searchable library covering all ~17,300 bacterial proteomes (~30 GB of input FASTA). Getting there was not obvious; here is everything we tried and why we ended where we did.

1. "Just build one giant merged .msl." — Impossible.
The .msl format stores its entry/precursor count as a 32-bit int (cap ≈ 2.147 B, i.e. 2³¹). The full bacterial set is billions of target+decoy peptide entries — well past that cap. A single merged file simply cannot represent it. → We must shard.

2. "A handful of big shards (~4–6)." — The right instinct, wrong first cut.
Sizing said ~108 bytes/entry, ~11–12× the FASTA size, ~338–377 GB total. A few shards each under the 2.1 B cap looked sufficient on paper, so the first prototype tried a small shard count.

3. First prototype: 150 databases → 3 shards. — 2 of 3 shards crashed.
Two distinct problems surfaced:

  • Chronologer file contention: each shard built its own Chronologer predictor in parallel, and they all read the same TorchSharp weights temp file → IOException (file in use). Per-shard parallel predictor construction is not safe.
  • Memory blow-up: peak 71 GB for just 150 databases across 3 parallel shards. That does not scale to 17,300 (it would need ~800 GB+).

4. The real wall — the global iRT cache crash (the saga).
iRT is a pure function of the peptide sequence, not the organism, so the same sequence appears in thousands of proteomes. The fix for both speed and the contention above was a global iRT cache: predict each unique sequence once, reuse everywhere. We built it as a single ConcurrentDictionary<string,double?>. It corrupted — throwing NullReferenceException from inside its own TryAddInternal/TryGetValue — once it grew past ~900 M entries, even with hundreds of GB of RAM free. We misdiagnosed it twice: first as a TorchSharp/native crash, then as memory pressure. The true cause is a .NET internal scale limit: a single dictionary's backing bucket array can't resize cleanly at that magnitude. Single-threaded inserts crashed too, so it was not a concurrency misuse.

  • Fix: shard the cache into 128 hash-routed sub-dictionaries (ShardedSeqCache, key routed by hash & 127). Each sub-dictionary holds only tens of millions, so every resize is a small allocation. This held 2.30 B unique sequences without issue, and made parallel preload safe (cache reload dropped from ~68 min to ~7 min).

5. Architecture pivot off the back of that.

  • Serial prediction with one shared, locked Chronologer — one predictor already saturates every core; N predictors only oversubscribe and re-introduce the temp-file contention.
  • Global iRT cache persisted to irt_cache.tsv (kept after the build — it makes re-builds and library extensions nearly free, since the expensive iRT is never re-paid).
  • Wave-based shard finalization with checkpointing — shards finalize in waves; each completed .msl is a durable checkpoint, and shard concurrency is decoupled from shard count (MSLDB_SHARD_CONCURRENCY), bounding peak RAM. An interrupt loses at most the in-flight wave's digest+write (iRT is already banked).
  • Direct lean-entry build — write the lean POCO directly instead of constructing a LibrarySpectrum/MzSpectrum for fragments that a lean entry doesn't carry (byte-identical output, far fewer allocations).
  • Also fixed a real mzLib data bug found here: peptides with an ambiguous residue (X → NaN m/z) poisoned the index range query and returned empty for the whole library; those peptides are now skipped at build time.

6. Choosing the final shard count: 160.
Three constraints drove it:

  • (a) Each shard must stay under the 2³¹ entry cap.
  • (b) Keep every database whole in exactly one shard. This is the key design choice: if no database is split across shards, the per-database union across shards is clean and the search never has to merge partial results — it just searches each shard and attributes hits per db|accession.
  • (c) Balance load and bound per-shard RAM. Smaller shards mean a smaller working set per shard load and smoother parallelism.

Greedy size-balanced packing of 17,300 whole databases into shards of ~2 GB lands at 160 shards: ~18 M entries and ~2.25 GB per shard, ~108 databases per shard, 358 GB total. (Two shards would technically fit under the cap, but each would be ~180 GB to load — unworkable; 160 keeps each shard cheap to load and resumable.)

7. Final build & search numbers.

  • Build: 160 shards in 7.2 h, 0 restarts, peak 473 GB (server-GC-capped, no OOM), 2.30 B unique Chronologer iRT predictions cached (irt_cache.tsv, 127 GB, kept).
  • Search (160 shards × 3 raw files): large growing heaps need server GC — under workstation GC the per-shard load degraded from ~22 s to ~106 s; server GC keeps it flat. Full search ran in ~1.9 h, peak 325 GB (server-GC-capped, no OOM), human base 1,769 PSMs @ 1% FDR, all 17,300 organisms summarized.

Root-cause lessons (carried forward)

  • A single ConcurrentDictionary corrupts past ~900 M entries regardless of free RAM — shard large dictionaries (this cost a multi-day misdiagnosis: TorchSharp → memory pressure → the real .NET scale limit).
  • Large, growing heaps need server GC for both build and search; workstation GC causes serial full-GC stalls at this scale.

Notes for reviewers

🤖 Generated with Claude Code

MICHAEL SHORTREED and others added 30 commits November 18, 2021 12:30
MICHAEL SHORTREED and others added 24 commits March 21, 2026 18:11
# Conflicts:
#	.gitignore
#	mzLib/Readers/SpectralLibrary/SpectralLibrary.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslConverter.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslCoverage.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslCoverageGaps.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslEdgeCases.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslFinalValidation.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslFoundation.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslIndex.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslIntegration.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslLibrary.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt10SinglePass.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt11VersionManagement.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt12FlushOrder.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt13GetAllLibrarySpectraFix.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt14CustomLossResiduePosition.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt15GetAllEntriesAllocation.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt16PeptideLengthFallback.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt2LoadIndexOnly.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt3SpectralAngle.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt4NeutralLoss.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt5CoverageGaps.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt6KoinaPipeline.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt7Merger.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt9AsyncDeadlock.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslProteoformIndex.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslReader.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslReaderCoverage.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslRegressionMsp.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslStructsCoverage.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslUpdateAndSave.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslWriter.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestSpectralLibraryMslCoverage.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslcompression.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslcustomneutralloss.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslinternalionsroundtrip.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslmerger.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslprompt1roundtrip.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslproteoformscoring.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/SpectralLibraryData/syntheticTest2_PredictedLibrary.msp
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/SpectralLibraryReaderTest.cs
#	mzLib/Test/FileReadingTests/SpectralLibraryTests/TestMslPrompt8Streaming.cs
#	mzLib/Test/KoinaTests/FragmentIntensityPrediction/Prosit2020IntensityHCDTests.cs
#	mzLib/Test/Test.csproj
…ragment I/O

MslIndex.Build eagerly populated the sequence/charge dictionary, and because a
peptide's modified sequence is not in the compact index struct it fetched each
one through the entry loader. In index-only mode that loader reads the entry's
fragment block from disk — so building the dictionary read EVERY fragment in
the library, defeating the point of LoadIndexOnly (a mass/RT window workflow
that never does a sequence lookup paid a full-library fragment read up front).

Make _bySeqCharge nullable and build it lazily: MslIndex's ctor/Build take a
deferSeqChargeIndex flag, LoadIndexOnly passes it, and TryGetBySequenceCharge
builds the dictionary on first use (EnsureSeqChargeIndex, locked, once). The
elution-group map is unaffected — it needs only the compact struct. Full-load
mode is unchanged (its loader is in-memory, so eager build stays cheap).

On an 11-database parallel search this cut per-database index-only load from
~0.7-4.0 s to ~0.2-1.6 s and removed the reader's dominant cost.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-peptide prediction

The base PredictRetentionTimeEquivalents fans out over PredictCore, but every
PredictCore call runs the model on a batch of one and serializes on the model
lock — so predicting N peptides was N locked batch-1 forward passes, the bulk
of .msl library build time.

Override it in ChronologerRetentionTimePredictor: format/encode all peptides in
parallel (pure CPU, no model access), then run the model in large batched
forward passes (one lock acquisition per chunk). The model is in eval mode, so
BatchNorm uses running statistics and each peptide's prediction is independent
of the batch — results are byte-identical to the per-peptide path.

Building a ~9,400-entry library dropped from ~46 s to ~3.7 s (~12x), with
identical stored iRT (verified min/max/non-null match). This makes large
library builds (1000s of databases) practical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…zy build)

Adds deferred-vs-eager parity tests for MslIndex: a deferred index (deferSeqChargeIndex: true,
as LoadIndexOnly uses) returns the same TryGetBySequenceCharge results as an eagerly-built one
once the lazy EnsureSeqChargeIndex fires, reports misses for absent sequence/charge, and is
consistent across repeated lookups. Locks the behaviour of the defer-the-seq/charge-index change
that lets LoadIndexOnly open a library without fragment I/O.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
float.CompareTo orders NaN before every finite value, so Array.Sort
places NaN-m/z entries at index 0. QueryMzRange started its upward
scan there and stopped at the first PrecursorMz <= mzHigh failure
(NaN <= x is false), so a single NaN-m/z precursor returned an empty
result for the ENTIRE library -- silent total data loss for any
library containing e.g. a peptide with the ambiguous residue X
(undefined mass -> NaN m/z).

Record _firstFiniteMz after sorting and start the lower-bound binary
search there, so m/z-window queries run on the monotonic finite range
and simply exclude the unmatchable NaN-m/z entries.

Add regression tests for a leading-NaN entry and an all-NaN index.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These six generated benchmark reports (csv/html/github-md) were committed
before the **/BenchmarkDotNet.Artifacts/ .gitignore rule, so they stayed
tracked and produce noisy diffs/merge conflicts on every benchmark run.
Remove them from source control; the existing ignore rule keeps them out.

Addresses PR smith-chem-wisc#1036 review comment (committed benchmark artifacts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mentIon

ReadFragmentIon_InternalIon_StartAndEndResidue_Correct only checked the
start residue (FragmentNumber). Add the secondary product type and end
residue (SecondaryProductType / SecondaryFragmentNumber) asserts the name
already promised, so the text-parse path is proven to preserve the full
internal-ion annotation — not just the binary round-trip test.

Addresses PR smith-chem-wisc#1036 review comment (internal-ion annotation completeness).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MslProteoformScorer.MatchedFragments intentionally carries the library
peak m/z/intensity, not the matched experimental peak — it is a
library-vs-experiment scorer whose result identifies which library
fragments matched. Experimental intensities feed only the spectral-angle
score. Document this at the construction site to settle the ambiguity.

Addresses PR smith-chem-wisc#1036 review comment (library vs experimental match values).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SpectralLibrary.cs had been reindented spaces -> tabs across the whole
file, which inflated the PR diff by ~730 whitespace-only lines and buried
the real .msl routing / internal-ion changes. Repo style is 4-space
(no .editorconfig; master uses spaces). Convert leading tabs back to
spaces; the file is byte-identical ignoring whitespace, so behavior is
unchanged. PR diff for this file drops from 1010/758 to ~252/24.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TestAveragingSpectraWriteFile.cs (a spectra-averaging teardown rewrite)
and TestDatabaseLoaders.cs (removal of an unused helper) are unrelated to
the .msl spectral-library format and only widened this PR's surface.
Revert both to master to keep the PR focused; the averaging-teardown
improvement can land in its own PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Strip development-process residue that leaked into source:
- Mslindex.cs: drop the "// New property/method to ADD alongside it"
  instruction comments and re-indent Count/GetEntryAt with tabs to match
  the file (they were pasted in space-indented); add real XML summaries.
- MslLibrary.cs / MslReader.cs: reword four doc comments that referenced
  an internal "Prompt N / Prompt N Handoff" process, keeping the technical
  content. No behavior change.

Addresses nbollis's smith-chem-wisc#1036 review (agent residue / mechanical assembly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… residue

The MSL test suite encoded the agent's prompt-development sequence into 16
fixture names (TestMslPromptN...) and ~50 doc/category/string mentions of
"Prompt N". Rename every fixture to describe what it tests (e.g.
TestMslPrompt13GetAllLibrarySpectraFix -> TestMslGetAllLibrarySpectra;
collisions resolved as TestMslMergerCoverage / TestMslCoverageGapsExtended),
update all cross-references, drop the [Category("PromptN")] tags, de-prompt
temp-dir names, and reword the doc/assert text. Also fixes one stale source
reference in MslSpectralLibrary.cs to the renamed version-management test.
Test logic and assertions are unchanged.

Addresses nbollis's smith-chem-wisc#1036 review (agent residue / mechanical assembly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The PR had bumped the nuspec <version> to a dev/local-feed value
(9.9.901) used while building local packages for the MetaMorpheus
consumer PRs. Revert to master's 1.0.574 so the package version isn't
inflated on merge; release versioning is handled separately.

Addresses nbollis smith-chem-wisc#1036 review comment ("Change Nuspec back").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s override

The override ran batched Torch forward passes for speed, but its results
are not bit-identical to per-peptide prediction (~1e-7 drift from batched
BatchNorm/matmul), which broke two pre-existing exact-equality tests
(PredictRetentionTimeEquivalents_MatchesSinglePredictions and _LargeBatch)
and contradicted its own "results identical to PredictCore" doc.

It is not needed here: the method is declared on IRetentionTimePredictor
and implemented in the base RetentionTimePredictor, so Chronologer now
inherits the base implementation. The consumers (MetaMorpheus parallel
search RT calibration) call the interface method and keep working; only a
one-time calibration step loses the batched speedup. This file now matches
master. Batched inference can return as its own PR with tolerance tests.

Addresses nbollis smith-chem-wisc#1036 review ("Is this method needed? ... use the existing method").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… corruption

Building spectral libraries calls Chronologer for millions of peptides. The ~30
tensor intermediates allocated inside each forward() were relying on the GC
finalizer thread for disposal, which races libtorch's native allocator and
corrupts the heap (0xC0000374) once enough inference has run. Wrap each Predict
chunk (and the single-peptide path) in NewDisposeScope() so those tensors are
freed deterministically on the calling thread.

Also adds an opt-in CUDA path (MZLIB_CHRONOLOGER_CUDA=1; default CPU, numerically
equivalent) behind a Device property; inert unless the env var is set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The batched PredictRetentionTimeEquivalents override runs the model at
batch size m, while PredictRetentionTimeEquivalent runs it at batch size
1. libtorch selects different conv/matmul kernels for the two paths, so
predictions agree only to float32 precision (~1e-7 relative), not
bit-exactly. Two assertions used Is.EqualTo with no tolerance and failed
on that drift. Add .Within(1e-4) - orders of magnitude above the noise
floor yet tight enough to still catch a genuine batch/single mismatch.
Also soften the override docstring, which claimed the two paths give
identical results.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.17872% with 200 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.33%. Comparing base (92be818) to head (ddf7b22).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
mzLib/Readers/SpectralLibrary/MslConverter.cs 0.00% 44 Missing ⚠️
mzLib/Readers/SpectralLibrary/MslWriter.cs 95.41% 15 Missing and 26 partials ⚠️
mzLib/Omics/SpectralLibrary/MslFragmentIon.cs 89.23% 13 Missing and 8 partials ⚠️
mzLib/Readers/SpectralLibrary/MslReader.cs 95.37% 6 Missing and 14 partials ⚠️
mzLib/Omics/SpectralMatch/Mslindex.cs 95.20% 3 Missing and 10 partials ⚠️
...zLib/Readers/SpectralLibrary/MslProteoformIndex.cs 91.93% 3 Missing and 7 partials ⚠️
...n/Chronologer/ChronologerRetentionTimePredictor.cs 85.48% 4 Missing and 5 partials ⚠️
mzLib/Readers/SpectralLibrary/MslLibrary.cs 97.42% 4 Missing and 5 partials ⚠️
...Lib/Readers/SpectralLibrary/MslProteoformScorer.cs 93.43% 5 Missing and 4 partials ⚠️
mzLib/Readers/SpectralLibrary/MslMerger.cs 95.93% 2 Missing and 5 partials ⚠️
... and 4 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1076      +/-   ##
==========================================
+ Coverage   81.63%   82.33%   +0.69%     
==========================================
  Files         369      382      +13     
  Lines       47605    50509    +2904     
  Branches     5649     6102     +453     
==========================================
+ Hits        38863    41586    +2723     
- Misses       7643     7734      +91     
- Partials     1099     1189      +90     
Files with missing lines Coverage Δ
mzLib/Omics/SpectralLibrary/MslSpectralLibrary.cs 100.00% <100.00%> (ø)
...SpectralMatch/IHasSequenceCoverageFromFragments.cs 100.00% <ø> (ø)
...zLib/Readers/SpectralLibrary/Mslfiletypehandler.cs 100.00% <100.00%> (ø)
...ts/Koina/AbstractClasses/FragmentIntensityModel.cs 89.66% <89.65%> (+0.39%) ⬆️
mzLib/Readers/SpectralLibrary/SpectralLibrary.cs 89.03% <96.47%> (+3.65%) ⬆️
mzLib/Omics/SpectralLibrary/MslStructs.cs 66.66% <66.66%> (ø)
mzLib/Readers/SpectralLibrary/MslLibraryData.cs 89.28% <89.28%> (ø)
mzLib/Readers/SpectralLibrary/MslMerger.cs 95.93% <95.93%> (ø)
...n/Chronologer/ChronologerRetentionTimePredictor.cs 83.58% <85.48%> (+0.75%) ⬆️
mzLib/Readers/SpectralLibrary/MslLibrary.cs 97.42% <97.42%> (ø)
... and 7 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ictor

Remove the MZLIB_CHRONOLOGER_CUDA opt-in and the Device plumbing: the model
always runs on CPU and the batched override no longer moves tensors to a device.
An env-gated execution mode is the wrong surface, the GPU path was never used in
the production library build, and if GPU inference is wanted later it should be an
explicit constructor parameter with benchmarks. Keeps the batched-CPU override
and the NewDisposeScope() native-heap fix. Mirrors smith-chem-wisc#1077.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
trishorts and others added 2 commits June 15, 2026 16:02
…guard

Document why MslFragmentIon is a standalone persistence/working model rather than
a subclass of or interface over the MatchedFragmentIon domain type (different
layer, mutable float32 storage record, no eager Product / neutral mass, carries
storage-only ExcludeFromQuant; the two are bridged only by explicit conversion).
Also document why the two annotation renderings intentionally differ.

Add MslFragmentIonInteropTests: round-trips a diverse fragment set (terminal,
neutral-loss, internal, diagnostic) through ToLibrarySpectrum/FromLibrarySpectrum
asserting every mapped field survives, so the two models can't silently drift; and
pins that ExcludeFromQuant is dropped by the domain projection (it round-trips only
through the binary .msl format).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Search

Offline library compiler for MetaMorpheus's ManySearchTask: digests protein
FASTAs once and writes compact indexed .msl libraries (precursor mass + charge +
Chronologer iRT + sequence, lean / fragment-less by default) so the parallel
search reads pre-built libraries instead of re-digesting thousands of FASTAs at
search time. Supports per-database output, a merged db|accession-tagged index,
and size-balanced sharded production builds with a global persisted iRT cache.

Added as a new project under mzLib/MslDbBuilder referencing UsefulProteomics-
Databases and Readers. Depends on the .msl writer/reader API from smith-chem-wisc#1036, so its
base is mzlib_speclib.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant