Add MslDbBuilder: FASTA→.msl spectral-library builder for ManySearchTask by trishorts · Pull Request #1076 · smith-chem-wisc/mzLib

trishorts · 2026-06-15T19:00:20Z

What this PR adds

MslDbBuilder — a standalone .NET 8 console tool that turns protein FASTA databases into mzLib .msl spectral libraries. It is the offline "library compiler" for MetaMorpheus's ManySearchTask: instead of digesting thousands of FASTAs at search time, the search reads pre-built .msl files (precursor mass + charge + Chronologer iRT + sequence, optionally with stored fragments).

It is added as a new project under mzLib/MslDbBuilder/ referencing UsefulProteomicsDatabases and Readers, and it uses the .msl writer/reader API introduced in PR #1036. This PR is therefore stacked on #1036 (mzlib_speclib) — until that merges, the diff here also shows the .msl format work; it collapses to just the builder once #1036 lands. Please review/merge #1036 first.

Plain-language explanation (what it does and why)

A spectral-library search needs a peptide source: the list of candidate peptides (with their fragment masses) to compare against each spectrum. Normally MetaMorpheus generates that on the fly by digesting a FASTA every run. For ManySearchTask we want to search one sample against thousands of organisms at once (e.g. all bacteria), and re-digesting tens of thousands of FASTAs on every search is wasteful and slow.

MslDbBuilder does that digestion once, offline, and writes the result as compact indexed .msl files:

Reads each FASTA and digests it (trypsin, 0 missed cleavages, length ≥ 7, variable init-Met, Carbamidomethyl on C / variable Oxidation on M, max 2 mods), generating target and reversed-decoy peptides.
Predicts each peptide's retention time once with Chronologer (an iRT model) and stores it in the index, so the search can use an RT pre-filter.
Writes a lean library by default: precursor mass + charge + iRT + sequence, no stored fragments (the search re-fragments the handful of candidates that survive the precursor+RT filter — far cheaper than storing fragments for billions of peptides).
Can emit one .msl per organism, merge many organisms into a single indexed file (each entry tagged db|accession so the search can still report per-organism hits), or build directly into size-balanced shards for production scale.

The payoff at search time: thousands of "databases" load as a few big indexed files instead of thousands of FASTA opens, and most candidates are eliminated by a learned precursor-mass + RT gate before any fragmentation happens.

Modes

MslDbBuilder [--verify] [--lean] <outDir> <fasta...> — FASTA → one .msl per database.
--shardbuild <outDir> <numShards> <fasta...|@listfile> — build N lean, size-balanced merged shards directly from FASTAs, streaming, each entry tagged db|accession, with Chronologer iRT (the production path).
--merge <out.msl> <msl...> — combine many .msl into one merged file.
--read / --probe / --checkrt / --rtcalib / --streamtest — round-trip, inspection, RT-calibration and writer-test utilities.

How we arrived at the production design (160 shards) — the sizing journey

The goal was a single searchable library covering all ~17,300 bacterial proteomes (~30 GB of input FASTA). Getting there was not obvious; here is everything we tried and why we ended where we did.

1. "Just build one giant merged .msl." — Impossible.
The .msl format stores its entry/precursor count as a 32-bit int (cap ≈ 2.147 B, i.e. 2³¹). The full bacterial set is billions of target+decoy peptide entries — well past that cap. A single merged file simply cannot represent it. → We must shard.

2. "A handful of big shards (~4–6)." — The right instinct, wrong first cut.
Sizing said ~108 bytes/entry, ~11–12× the FASTA size, ~338–377 GB total. A few shards each under the 2.1 B cap looked sufficient on paper, so the first prototype tried a small shard count.

3. First prototype: 150 databases → 3 shards. — 2 of 3 shards crashed.
Two distinct problems surfaced:

Chronologer file contention: each shard built its own Chronologer predictor in parallel, and they all read the same TorchSharp weights temp file → IOException (file in use). Per-shard parallel predictor construction is not safe.
Memory blow-up: peak 71 GB for just 150 databases across 3 parallel shards. That does not scale to 17,300 (it would need ~800 GB+).

4. The real wall — the global iRT cache crash (the saga).
iRT is a pure function of the peptide sequence, not the organism, so the same sequence appears in thousands of proteomes. The fix for both speed and the contention above was a global iRT cache: predict each unique sequence once, reuse everywhere. We built it as a single ConcurrentDictionary<string,double?>. It corrupted — throwing NullReferenceException from inside its own TryAddInternal/TryGetValue — once it grew past ~900 M entries, even with hundreds of GB of RAM free. We misdiagnosed it twice: first as a TorchSharp/native crash, then as memory pressure. The true cause is a .NET internal scale limit: a single dictionary's backing bucket array can't resize cleanly at that magnitude. Single-threaded inserts crashed too, so it was not a concurrency misuse.

Fix: shard the cache into 128 hash-routed sub-dictionaries (ShardedSeqCache, key routed by hash & 127). Each sub-dictionary holds only tens of millions, so every resize is a small allocation. This held 2.30 B unique sequences without issue, and made parallel preload safe (cache reload dropped from ~68 min to ~7 min).

5. Architecture pivot off the back of that.

Serial prediction with one shared, locked Chronologer — one predictor already saturates every core; N predictors only oversubscribe and re-introduce the temp-file contention.
Global iRT cache persisted to irt_cache.tsv (kept after the build — it makes re-builds and library extensions nearly free, since the expensive iRT is never re-paid).
Wave-based shard finalization with checkpointing — shards finalize in waves; each completed .msl is a durable checkpoint, and shard concurrency is decoupled from shard count (MSLDB_SHARD_CONCURRENCY), bounding peak RAM. An interrupt loses at most the in-flight wave's digest+write (iRT is already banked).
Direct lean-entry build — write the lean POCO directly instead of constructing a LibrarySpectrum/MzSpectrum for fragments that a lean entry doesn't carry (byte-identical output, far fewer allocations).
Also fixed a real mzLib data bug found here: peptides with an ambiguous residue (X → NaN m/z) poisoned the index range query and returned empty for the whole library; those peptides are now skipped at build time.

6. Choosing the final shard count: 160.
Three constraints drove it:

(a) Each shard must stay under the 2³¹ entry cap.
(b) Keep every database whole in exactly one shard. This is the key design choice: if no database is split across shards, the per-database union across shards is clean and the search never has to merge partial results — it just searches each shard and attributes hits per db|accession.
(c) Balance load and bound per-shard RAM. Smaller shards mean a smaller working set per shard load and smoother parallelism.

Greedy size-balanced packing of 17,300 whole databases into shards of ~2 GB lands at 160 shards: ~18 M entries and ~2.25 GB per shard, ~108 databases per shard, 358 GB total. (Two shards would technically fit under the cap, but each would be ~180 GB to load — unworkable; 160 keeps each shard cheap to load and resumable.)

7. Final build & search numbers.

Build: 160 shards in 7.2 h, 0 restarts, peak 473 GB (server-GC-capped, no OOM), 2.30 B unique Chronologer iRT predictions cached (irt_cache.tsv, 127 GB, kept).
Search (160 shards × 3 raw files): large growing heaps need server GC — under workstation GC the per-shard load degraded from ~22 s to ~106 s; server GC keeps it flat. Full search ran in ~1.9 h, peak 325 GB (server-GC-capped, no OOM), human base 1,769 PSMs @ 1% FDR, all 17,300 organisms summarized.

Root-cause lessons (carried forward)

A single ConcurrentDictionary corrupts past ~900 M entries regardless of free RAM — shard large dictionaries (this cost a multi-day misdiagnosis: TorchSharp → memory pressure → the real .NET scale limit).
Large, growing heaps need server GC for both build and search; workstation GC causes serial full-GC stalls at this scale.

Notes for reviewers

This is a console tool, added under mzLib/MslDbBuilder/. If the team would prefer it live elsewhere (e.g. a tools/ folder or a separate repo), say the word.
It depends on the .msl API from MSL mzLib Spectrum Library indexed binary format #1036, so its base is mzlib_speclib; it will not build against master until MSL mzLib Spectrum Library indexed binary format #1036 merges.

🤖 Generated with Claude Code

# Conflicts: # .gitignore # mzLib/Readers/SpectralLibrary/SpectralLibrary.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslConverter.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslCoverage.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslCoverageGaps.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslEdgeCases.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslFinalValidation.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslFoundation.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslIndex.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslIntegration.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslLibrary.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt10SinglePass.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt11VersionManagement.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt12FlushOrder.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt13GetAllLibrarySpectraFix.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt14CustomLossResiduePosition.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt15GetAllEntriesAllocation.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt16PeptideLengthFallback.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt2LoadIndexOnly.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt3SpectralAngle.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt4NeutralLoss.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt5CoverageGaps.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt6KoinaPipeline.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt7Merger.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt9AsyncDeadlock.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslProteoformIndex.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslReader.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslReaderCoverage.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslRegressionMsp.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslStructsCoverage.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslUpdateAndSave.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslWriter.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestSpectralLibraryMslCoverage.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslcompression.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslcustomneutralloss.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslinternalionsroundtrip.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslmerger.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslprompt1roundtrip.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslproteoformscoring.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/SpectralLibraryData/syntheticTest2_PredictedLibrary.msp # mzLib/Test/FileReadingTests/SpectralLibraryTests/SpectralLibraryReaderTest.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/TestMslPrompt8Streaming.cs # mzLib/Test/KoinaTests/FragmentIntensityPrediction/Prosit2020IntensityHCDTests.cs # mzLib/Test/Test.csproj

…e shadows type)

…ragment I/O MslIndex.Build eagerly populated the sequence/charge dictionary, and because a peptide's modified sequence is not in the compact index struct it fetched each one through the entry loader. In index-only mode that loader reads the entry's fragment block from disk — so building the dictionary read EVERY fragment in the library, defeating the point of LoadIndexOnly (a mass/RT window workflow that never does a sequence lookup paid a full-library fragment read up front). Make _bySeqCharge nullable and build it lazily: MslIndex's ctor/Build take a deferSeqChargeIndex flag, LoadIndexOnly passes it, and TryGetBySequenceCharge builds the dictionary on first use (EnsureSeqChargeIndex, locked, once). The elution-group map is unaffected — it needs only the compact struct. Full-load mode is unchanged (its loader is in-memory, so eager build stays cheap). On an 11-database parallel search this cut per-database index-only load from ~0.7-4.0 s to ~0.2-1.6 s and removed the reader's dominant cost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-peptide prediction The base PredictRetentionTimeEquivalents fans out over PredictCore, but every PredictCore call runs the model on a batch of one and serializes on the model lock — so predicting N peptides was N locked batch-1 forward passes, the bulk of .msl library build time. Override it in ChronologerRetentionTimePredictor: format/encode all peptides in parallel (pure CPU, no model access), then run the model in large batched forward passes (one lock acquisition per chunk). The model is in eval mode, so BatchNorm uses running statistics and each peptide's prediction is independent of the batch — results are byte-identical to the per-peptide path. Building a ~9,400-entry library dropped from ~46 s to ~3.7 s (~12x), with identical stored iRT (verified min/max/non-null match). This makes large library builds (1000s of databases) practical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…zy build) Adds deferred-vs-eager parity tests for MslIndex: a deferred index (deferSeqChargeIndex: true, as LoadIndexOnly uses) returns the same TryGetBySequenceCharge results as an eagerly-built one once the lazy EnsureSeqChargeIndex fires, reports misses for absent sequence/charge, and is consistent across repeated lookups. Locks the behaviour of the defer-the-seq/charge-index change that lets LoadIndexOnly open a library without fragment I/O. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

float.CompareTo orders NaN before every finite value, so Array.Sort places NaN-m/z entries at index 0. QueryMzRange started its upward scan there and stopped at the first PrecursorMz <= mzHigh failure (NaN <= x is false), so a single NaN-m/z precursor returned an empty result for the ENTIRE library -- silent total data loss for any library containing e.g. a peptide with the ambiguous residue X (undefined mass -> NaN m/z). Record _firstFiniteMz after sorting and start the lower-bound binary search there, so m/z-window queries run on the monotonic finite range and simply exclude the unmatchable NaN-m/z entries. Add regression tests for a leading-NaN entry and an all-NaN index. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

These six generated benchmark reports (csv/html/github-md) were committed before the **/BenchmarkDotNet.Artifacts/ .gitignore rule, so they stayed tracked and produce noisy diffs/merge conflicts on every benchmark run. Remove them from source control; the existing ignore rule keeps them out. Addresses PR smith-chem-wisc#1036 review comment (committed benchmark artifacts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mentIon ReadFragmentIon_InternalIon_StartAndEndResidue_Correct only checked the start residue (FragmentNumber). Add the secondary product type and end residue (SecondaryProductType / SecondaryFragmentNumber) asserts the name already promised, so the text-parse path is proven to preserve the full internal-ion annotation — not just the binary round-trip test. Addresses PR smith-chem-wisc#1036 review comment (internal-ion annotation completeness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MslProteoformScorer.MatchedFragments intentionally carries the library peak m/z/intensity, not the matched experimental peak — it is a library-vs-experiment scorer whose result identifies which library fragments matched. Experimental intensities feed only the spectral-angle score. Document this at the construction site to settle the ambiguity. Addresses PR smith-chem-wisc#1036 review comment (library vs experimental match values). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

SpectralLibrary.cs had been reindented spaces -> tabs across the whole file, which inflated the PR diff by ~730 whitespace-only lines and buried the real .msl routing / internal-ion changes. Repo style is 4-space (no .editorconfig; master uses spaces). Convert leading tabs back to spaces; the file is byte-identical ignoring whitespace, so behavior is unchanged. PR diff for this file drops from 1010/758 to ~252/24. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TestAveragingSpectraWriteFile.cs (a spectra-averaging teardown rewrite) and TestDatabaseLoaders.cs (removal of an unused helper) are unrelated to the .msl spectral-library format and only widened this PR's surface. Revert both to master to keep the PR focused; the averaging-teardown improvement can land in its own PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Strip development-process residue that leaked into source: - Mslindex.cs: drop the "// New property/method to ADD alongside it" instruction comments and re-indent Count/GetEntryAt with tabs to match the file (they were pasted in space-indented); add real XML summaries. - MslLibrary.cs / MslReader.cs: reword four doc comments that referenced an internal "Prompt N / Prompt N Handoff" process, keeping the technical content. No behavior change. Addresses nbollis's smith-chem-wisc#1036 review (agent residue / mechanical assembly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… residue The MSL test suite encoded the agent's prompt-development sequence into 16 fixture names (TestMslPromptN...) and ~50 doc/category/string mentions of "Prompt N". Rename every fixture to describe what it tests (e.g. TestMslPrompt13GetAllLibrarySpectraFix -> TestMslGetAllLibrarySpectra; collisions resolved as TestMslMergerCoverage / TestMslCoverageGapsExtended), update all cross-references, drop the [Category("PromptN")] tags, de-prompt temp-dir names, and reword the doc/assert text. Also fixes one stale source reference in MslSpectralLibrary.cs to the renamed version-management test. Test logic and assertions are unchanged. Addresses nbollis's smith-chem-wisc#1036 review (agent residue / mechanical assembly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The PR had bumped the nuspec <version> to a dev/local-feed value (9.9.901) used while building local packages for the MetaMorpheus consumer PRs. Revert to master's 1.0.574 so the package version isn't inflated on merge; release versioning is handled separately. Addresses nbollis smith-chem-wisc#1036 review comment ("Change Nuspec back"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s override The override ran batched Torch forward passes for speed, but its results are not bit-identical to per-peptide prediction (~1e-7 drift from batched BatchNorm/matmul), which broke two pre-existing exact-equality tests (PredictRetentionTimeEquivalents_MatchesSinglePredictions and _LargeBatch) and contradicted its own "results identical to PredictCore" doc. It is not needed here: the method is declared on IRetentionTimePredictor and implemented in the base RetentionTimePredictor, so Chronologer now inherits the base implementation. The consumers (MetaMorpheus parallel search RT calibration) call the interface method and keep working; only a one-time calibration step loses the batched speedup. This file now matches master. Batched inference can return as its own PR with tolerance tests. Addresses nbollis smith-chem-wisc#1036 review ("Is this method needed? ... use the existing method"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… corruption Building spectral libraries calls Chronologer for millions of peptides. The ~30 tensor intermediates allocated inside each forward() were relying on the GC finalizer thread for disposal, which races libtorch's native allocator and corrupts the heap (0xC0000374) once enough inference has run. Wrap each Predict chunk (and the single-peptide path) in NewDisposeScope() so those tensors are freed deterministically on the calling thread. Also adds an opt-in CUDA path (MZLIB_CHRONOLOGER_CUDA=1; default CPU, numerically equivalent) behind a Device property; inert unless the env var is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The batched PredictRetentionTimeEquivalents override runs the model at batch size m, while PredictRetentionTimeEquivalent runs it at batch size 1. libtorch selects different conv/matmul kernels for the two paths, so predictions agree only to float32 precision (~1e-7 relative), not bit-exactly. Two assertions used Is.EqualTo with no tolerance and failed on that drift. Add .Within(1e-4) - orders of magnitude above the noise floor yet tight enough to still catch a genuine batch/single mismatch. Also soften the override docstring, which claimed the two paths give identical results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-15T19:09:44Z

Codecov Report

❌ Patch coverage is 93.17872% with 200 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.33%. Comparing base (92be818) to head (ddf7b22).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
mzLib/Readers/SpectralLibrary/MslConverter.cs	0.00%	44 Missing ⚠️
mzLib/Readers/SpectralLibrary/MslWriter.cs	95.41%	15 Missing and 26 partials ⚠️
mzLib/Omics/SpectralLibrary/MslFragmentIon.cs	89.23%	13 Missing and 8 partials ⚠️
mzLib/Readers/SpectralLibrary/MslReader.cs	95.37%	6 Missing and 14 partials ⚠️
mzLib/Omics/SpectralMatch/Mslindex.cs	95.20%	3 Missing and 10 partials ⚠️
...zLib/Readers/SpectralLibrary/MslProteoformIndex.cs	91.93%	3 Missing and 7 partials ⚠️
...n/Chronologer/ChronologerRetentionTimePredictor.cs	85.48%	4 Missing and 5 partials ⚠️
mzLib/Readers/SpectralLibrary/MslLibrary.cs	97.42%	4 Missing and 5 partials ⚠️
...Lib/Readers/SpectralLibrary/MslProteoformScorer.cs	93.43%	5 Missing and 4 partials ⚠️
mzLib/Readers/SpectralLibrary/MslMerger.cs	95.93%	2 Missing and 5 partials ⚠️
... and 4 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1076      +/-   ##
==========================================
+ Coverage   81.63%   82.33%   +0.69%     
==========================================
  Files         369      382      +13     
  Lines       47605    50509    +2904     
  Branches     5649     6102     +453     
==========================================
+ Hits        38863    41586    +2723     
- Misses       7643     7734      +91     
- Partials     1099     1189      +90

Files with missing lines	Coverage Δ
mzLib/Omics/SpectralLibrary/MslSpectralLibrary.cs	`100.00% <100.00%> (ø)`
...SpectralMatch/IHasSequenceCoverageFromFragments.cs	`100.00% <ø> (ø)`
...zLib/Readers/SpectralLibrary/Mslfiletypehandler.cs	`100.00% <100.00%> (ø)`
...ts/Koina/AbstractClasses/FragmentIntensityModel.cs	`89.66% <89.65%> (+0.39%)`	⬆️
mzLib/Readers/SpectralLibrary/SpectralLibrary.cs	`89.03% <96.47%> (+3.65%)`	⬆️
mzLib/Omics/SpectralLibrary/MslStructs.cs	`66.66% <66.66%> (ø)`
mzLib/Readers/SpectralLibrary/MslLibraryData.cs	`89.28% <89.28%> (ø)`
mzLib/Readers/SpectralLibrary/MslMerger.cs	`95.93% <95.93%> (ø)`
...n/Chronologer/ChronologerRetentionTimePredictor.cs	`83.58% <85.48%> (+0.75%)`	⬆️
mzLib/Readers/SpectralLibrary/MslLibrary.cs	`97.42% <97.42%> (ø)`
... and 7 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ictor Remove the MZLIB_CHRONOLOGER_CUDA opt-in and the Device plumbing: the model always runs on CPU and the batched override no longer moves tensors to a device. An env-gated execution mode is the wrong surface, the GPU path was never used in the production library build, and if GPU inference is wanted later it should be an explicit constructor parameter with benchmarks. Keeps the batched-CPU override and the NewDisposeScope() native-heap fix. Mirrors smith-chem-wisc#1077. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…guard Document why MslFragmentIon is a standalone persistence/working model rather than a subclass of or interface over the MatchedFragmentIon domain type (different layer, mutable float32 storage record, no eager Product / neutral mass, carries storage-only ExcludeFromQuant; the two are bridged only by explicit conversion). Also document why the two annotation renderings intentionally differ. Add MslFragmentIonInteropTests: round-trips a diverse fragment set (terminal, neutral-loss, internal, diagnostic) through ToLibrarySpectrum/FromLibrarySpectrum asserting every mapped field survives, so the two models can't silently drift; and pins that ExcludeFromQuant is dropped by the domain projection (it round-trips only through the binary .msl format). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Search Offline library compiler for MetaMorpheus's ManySearchTask: digests protein FASTAs once and writes compact indexed .msl libraries (precursor mass + charge + Chronologer iRT + sequence, lean / fragment-less by default) so the parallel search reads pre-built libraries instead of re-digesting thousands of FASTAs at search time. Supports per-database output, a merged db|accession-tagged index, and size-balanced sharded production builds with a global persisted iRT cache. Added as a new project under mzLib/MslDbBuilder referencing UsefulProteomics- Databases and Readers. Depends on the .msl writer/reader API from smith-chem-wisc#1036, so its base is mzlib_speclib. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MICHAEL SHORTREED and others added 30 commits November 18, 2021 12:30

correct Within calculation

a6b1639

update unit tests

fa4da8b

conflicts resolved back to upstream

3246567

Merge remote-tracking branch 'upstream/master'

a018d4d

Merge remote-tracking branch 'upstream/master'

15a37d0

this is the spot

892fa45

Merge remote-tracking branch 'upstream/master'

211013c

Merge branch 'master' of https://github.com/trishorts/mzLib

68104ee

Merge remote-tracking branch 'upstream/master'

d715a08

Merge remote-tracking branch 'upstream/master'

3565522

Merge remote-tracking branch 'upstream/master'

72e7b53

Merge remote-tracking branch 'upstream/master'

593872a

Merge branch 'master' of https://github.com/trishorts/mzLib

42dd034

Merge remote-tracking branch 'upstream/master'

fbeaec0

Merge remote-tracking branch 'upstream/master'

614ded7

Merge branch 'master' of https://github.com/trishorts/mzLib

47307c8

Merge remote-tracking branch 'upstream/master'

28e05ae

Merge remote-tracking branch 'upstream/master'

0a7c609

Merge remote-tracking branch 'upstream/master'

630d8c7

Merge branch 'master' of https://github.com/trishorts/mzLib

f6a386b

Merge remote-tracking branch 'upstream/master'

d673800

Merge branch 'master' of https://github.com/trishorts/mzLib

675a0ae

Merge remote-tracking branch 'upstream/master'

15d4baf

Merge remote-tracking branch 'upstream/master'

03ca9f7

Merge remote-tracking branch 'upstream/master'

d0a4c79

Merge remote-tracking branch 'upstream/master'

894b998

Merge remote-tracking branch 'upstream/master'

dd0c8c9

Merge remote-tracking branch 'upstream/master'

c6310ae

Merge remote-tracking branch 'upstream/master' into mzlib_speclib

34063e7

msl foundation

f403791

MICHAEL SHORTREED and others added 24 commits March 21, 2026 18:11

be gentle

b80f7aa

normalize retention time of incoming spectra to existing library

46c6424

variable naming consistency

8542188

nics review respnse

88244a1

internal review

9bd1e0b

merge upstream

c383e74

review fixes

e909066

fix: qualify SpectralLibrary type in merged Prosit2020 test (namespac…

0f8caa6

…e shadows type)

trishorts force-pushed the msldbbuilder branch from 42b2484 to affd0ab Compare June 15, 2026 20:25

trishorts and others added 2 commits June 15, 2026 16:02

trishorts force-pushed the msldbbuilder branch from affd0ab to ddf7b22 Compare June 15, 2026 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MslDbBuilder: FASTA→.msl spectral-library builder for ManySearchTask#1076

Add MslDbBuilder: FASTA→.msl spectral-library builder for ManySearchTask#1076
trishorts wants to merge 92 commits into
smith-chem-wisc:masterfrom
trishorts:msldbbuilder

trishorts commented Jun 15, 2026

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trishorts commented Jun 15, 2026

What this PR adds

Plain-language explanation (what it does and why)

Modes

How we arrived at the production design (160 shards) — the sizing journey

Root-cause lessons (carried forward)

Notes for reviewers

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 15, 2026 •

edited

Loading