Add MslDbBuilder: FASTA→.msl spectral-library builder for ManySearchTask#1076
Open
trishorts wants to merge 92 commits into
Open
Add MslDbBuilder: FASTA→.msl spectral-library builder for ManySearchTask#1076trishorts wants to merge 92 commits into
trishorts wants to merge 92 commits into
Conversation
# Conflicts: # .gitignore # mzLib/Readers/SpectralLibrary/SpectralLibrary.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslConverter.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslCoverage.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslCoverageGaps.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslEdgeCases.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslFinalValidation.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslFoundation.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslIndex.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslIntegration.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslLibrary.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt10SinglePass.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt11VersionManagement.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt12FlushOrder.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt13GetAllLibrarySpectraFix.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt14CustomLossResiduePosition.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt15GetAllEntriesAllocation.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt16PeptideLengthFallback.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt2LoadIndexOnly.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt3SpectralAngle.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt4NeutralLoss.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt5CoverageGaps.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt6KoinaPipeline.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt7Merger.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslPrompt9AsyncDeadlock.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslProteoformIndex.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslReader.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslReaderCoverage.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslRegressionMsp.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslStructsCoverage.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslUpdateAndSave.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestMslWriter.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/TestSpectralLibraryMslCoverage.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslcompression.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslcustomneutralloss.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslinternalionsroundtrip.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslmerger.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslprompt1roundtrip.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/MSL/Testmslproteoformscoring.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/SpectralLibraryData/syntheticTest2_PredictedLibrary.msp # mzLib/Test/FileReadingTests/SpectralLibraryTests/SpectralLibraryReaderTest.cs # mzLib/Test/FileReadingTests/SpectralLibraryTests/TestMslPrompt8Streaming.cs # mzLib/Test/KoinaTests/FragmentIntensityPrediction/Prosit2020IntensityHCDTests.cs # mzLib/Test/Test.csproj
…ragment I/O MslIndex.Build eagerly populated the sequence/charge dictionary, and because a peptide's modified sequence is not in the compact index struct it fetched each one through the entry loader. In index-only mode that loader reads the entry's fragment block from disk — so building the dictionary read EVERY fragment in the library, defeating the point of LoadIndexOnly (a mass/RT window workflow that never does a sequence lookup paid a full-library fragment read up front). Make _bySeqCharge nullable and build it lazily: MslIndex's ctor/Build take a deferSeqChargeIndex flag, LoadIndexOnly passes it, and TryGetBySequenceCharge builds the dictionary on first use (EnsureSeqChargeIndex, locked, once). The elution-group map is unaffected — it needs only the compact struct. Full-load mode is unchanged (its loader is in-memory, so eager build stays cheap). On an 11-database parallel search this cut per-database index-only load from ~0.7-4.0 s to ~0.2-1.6 s and removed the reader's dominant cost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-peptide prediction The base PredictRetentionTimeEquivalents fans out over PredictCore, but every PredictCore call runs the model on a batch of one and serializes on the model lock — so predicting N peptides was N locked batch-1 forward passes, the bulk of .msl library build time. Override it in ChronologerRetentionTimePredictor: format/encode all peptides in parallel (pure CPU, no model access), then run the model in large batched forward passes (one lock acquisition per chunk). The model is in eval mode, so BatchNorm uses running statistics and each peptide's prediction is independent of the batch — results are byte-identical to the per-peptide path. Building a ~9,400-entry library dropped from ~46 s to ~3.7 s (~12x), with identical stored iRT (verified min/max/non-null match). This makes large library builds (1000s of databases) practical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…zy build) Adds deferred-vs-eager parity tests for MslIndex: a deferred index (deferSeqChargeIndex: true, as LoadIndexOnly uses) returns the same TryGetBySequenceCharge results as an eagerly-built one once the lazy EnsureSeqChargeIndex fires, reports misses for absent sequence/charge, and is consistent across repeated lookups. Locks the behaviour of the defer-the-seq/charge-index change that lets LoadIndexOnly open a library without fragment I/O. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
float.CompareTo orders NaN before every finite value, so Array.Sort places NaN-m/z entries at index 0. QueryMzRange started its upward scan there and stopped at the first PrecursorMz <= mzHigh failure (NaN <= x is false), so a single NaN-m/z precursor returned an empty result for the ENTIRE library -- silent total data loss for any library containing e.g. a peptide with the ambiguous residue X (undefined mass -> NaN m/z). Record _firstFiniteMz after sorting and start the lower-bound binary search there, so m/z-window queries run on the monotonic finite range and simply exclude the unmatchable NaN-m/z entries. Add regression tests for a leading-NaN entry and an all-NaN index. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These six generated benchmark reports (csv/html/github-md) were committed before the **/BenchmarkDotNet.Artifacts/ .gitignore rule, so they stayed tracked and produce noisy diffs/merge conflicts on every benchmark run. Remove them from source control; the existing ignore rule keeps them out. Addresses PR smith-chem-wisc#1036 review comment (committed benchmark artifacts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mentIon ReadFragmentIon_InternalIon_StartAndEndResidue_Correct only checked the start residue (FragmentNumber). Add the secondary product type and end residue (SecondaryProductType / SecondaryFragmentNumber) asserts the name already promised, so the text-parse path is proven to preserve the full internal-ion annotation — not just the binary round-trip test. Addresses PR smith-chem-wisc#1036 review comment (internal-ion annotation completeness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MslProteoformScorer.MatchedFragments intentionally carries the library peak m/z/intensity, not the matched experimental peak — it is a library-vs-experiment scorer whose result identifies which library fragments matched. Experimental intensities feed only the spectral-angle score. Document this at the construction site to settle the ambiguity. Addresses PR smith-chem-wisc#1036 review comment (library vs experimental match values). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SpectralLibrary.cs had been reindented spaces -> tabs across the whole file, which inflated the PR diff by ~730 whitespace-only lines and buried the real .msl routing / internal-ion changes. Repo style is 4-space (no .editorconfig; master uses spaces). Convert leading tabs back to spaces; the file is byte-identical ignoring whitespace, so behavior is unchanged. PR diff for this file drops from 1010/758 to ~252/24. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TestAveragingSpectraWriteFile.cs (a spectra-averaging teardown rewrite) and TestDatabaseLoaders.cs (removal of an unused helper) are unrelated to the .msl spectral-library format and only widened this PR's surface. Revert both to master to keep the PR focused; the averaging-teardown improvement can land in its own PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Strip development-process residue that leaked into source: - Mslindex.cs: drop the "// New property/method to ADD alongside it" instruction comments and re-indent Count/GetEntryAt with tabs to match the file (they were pasted in space-indented); add real XML summaries. - MslLibrary.cs / MslReader.cs: reword four doc comments that referenced an internal "Prompt N / Prompt N Handoff" process, keeping the technical content. No behavior change. Addresses nbollis's smith-chem-wisc#1036 review (agent residue / mechanical assembly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… residue
The MSL test suite encoded the agent's prompt-development sequence into 16
fixture names (TestMslPromptN...) and ~50 doc/category/string mentions of
"Prompt N". Rename every fixture to describe what it tests (e.g.
TestMslPrompt13GetAllLibrarySpectraFix -> TestMslGetAllLibrarySpectra;
collisions resolved as TestMslMergerCoverage / TestMslCoverageGapsExtended),
update all cross-references, drop the [Category("PromptN")] tags, de-prompt
temp-dir names, and reword the doc/assert text. Also fixes one stale source
reference in MslSpectralLibrary.cs to the renamed version-management test.
Test logic and assertions are unchanged.
Addresses nbollis's smith-chem-wisc#1036 review (agent residue / mechanical assembly).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The PR had bumped the nuspec <version> to a dev/local-feed value (9.9.901) used while building local packages for the MetaMorpheus consumer PRs. Revert to master's 1.0.574 so the package version isn't inflated on merge; release versioning is handled separately. Addresses nbollis smith-chem-wisc#1036 review comment ("Change Nuspec back"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s override The override ran batched Torch forward passes for speed, but its results are not bit-identical to per-peptide prediction (~1e-7 drift from batched BatchNorm/matmul), which broke two pre-existing exact-equality tests (PredictRetentionTimeEquivalents_MatchesSinglePredictions and _LargeBatch) and contradicted its own "results identical to PredictCore" doc. It is not needed here: the method is declared on IRetentionTimePredictor and implemented in the base RetentionTimePredictor, so Chronologer now inherits the base implementation. The consumers (MetaMorpheus parallel search RT calibration) call the interface method and keep working; only a one-time calibration step loses the batched speedup. This file now matches master. Batched inference can return as its own PR with tolerance tests. Addresses nbollis smith-chem-wisc#1036 review ("Is this method needed? ... use the existing method"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… corruption Building spectral libraries calls Chronologer for millions of peptides. The ~30 tensor intermediates allocated inside each forward() were relying on the GC finalizer thread for disposal, which races libtorch's native allocator and corrupts the heap (0xC0000374) once enough inference has run. Wrap each Predict chunk (and the single-peptide path) in NewDisposeScope() so those tensors are freed deterministically on the calling thread. Also adds an opt-in CUDA path (MZLIB_CHRONOLOGER_CUDA=1; default CPU, numerically equivalent) behind a Device property; inert unless the env var is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The batched PredictRetentionTimeEquivalents override runs the model at batch size m, while PredictRetentionTimeEquivalent runs it at batch size 1. libtorch selects different conv/matmul kernels for the two paths, so predictions agree only to float32 precision (~1e-7 relative), not bit-exactly. Two assertions used Is.EqualTo with no tolerance and failed on that drift. Add .Within(1e-4) - orders of magnitude above the noise floor yet tight enough to still catch a genuine batch/single mismatch. Also soften the override docstring, which claimed the two paths give identical results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ictor Remove the MZLIB_CHRONOLOGER_CUDA opt-in and the Device plumbing: the model always runs on CPU and the batched override no longer moves tensors to a device. An env-gated execution mode is the wrong surface, the GPU path was never used in the production library build, and if GPU inference is wanted later it should be an explicit constructor parameter with benchmarks. Keeps the batched-CPU override and the NewDisposeScope() native-heap fix. Mirrors smith-chem-wisc#1077. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…guard Document why MslFragmentIon is a standalone persistence/working model rather than a subclass of or interface over the MatchedFragmentIon domain type (different layer, mutable float32 storage record, no eager Product / neutral mass, carries storage-only ExcludeFromQuant; the two are bridged only by explicit conversion). Also document why the two annotation renderings intentionally differ. Add MslFragmentIonInteropTests: round-trips a diverse fragment set (terminal, neutral-loss, internal, diagnostic) through ToLibrarySpectrum/FromLibrarySpectrum asserting every mapped field survives, so the two models can't silently drift; and pins that ExcludeFromQuant is dropped by the domain projection (it round-trips only through the binary .msl format). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Search Offline library compiler for MetaMorpheus's ManySearchTask: digests protein FASTAs once and writes compact indexed .msl libraries (precursor mass + charge + Chronologer iRT + sequence, lean / fragment-less by default) so the parallel search reads pre-built libraries instead of re-digesting thousands of FASTAs at search time. Supports per-database output, a merged db|accession-tagged index, and size-balanced sharded production builds with a global persisted iRT cache. Added as a new project under mzLib/MslDbBuilder referencing UsefulProteomics- Databases and Readers. Depends on the .msl writer/reader API from smith-chem-wisc#1036, so its base is mzlib_speclib. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR adds
MslDbBuilder— a standalone .NET 8 console tool that turns protein FASTA databases into mzLib.mslspectral libraries. It is the offline "library compiler" for MetaMorpheus's ManySearchTask: instead of digesting thousands of FASTAs at search time, the search reads pre-built.mslfiles (precursor mass + charge + Chronologer iRT + sequence, optionally with stored fragments).It is added as a new project under
mzLib/MslDbBuilder/referencingUsefulProteomicsDatabasesandReaders, and it uses the.mslwriter/reader API introduced in PR #1036. This PR is therefore stacked on #1036 (mzlib_speclib) — until that merges, the diff here also shows the.mslformat work; it collapses to just the builder once #1036 lands. Please review/merge #1036 first.Plain-language explanation (what it does and why)
A spectral-library search needs a peptide source: the list of candidate peptides (with their fragment masses) to compare against each spectrum. Normally MetaMorpheus generates that on the fly by digesting a FASTA every run. For ManySearchTask we want to search one sample against thousands of organisms at once (e.g. all bacteria), and re-digesting tens of thousands of FASTAs on every search is wasteful and slow.
MslDbBuilderdoes that digestion once, offline, and writes the result as compact indexed.mslfiles:.mslper organism, merge many organisms into a single indexed file (each entry taggeddb|accessionso the search can still report per-organism hits), or build directly into size-balanced shards for production scale.The payoff at search time: thousands of "databases" load as a few big indexed files instead of thousands of FASTA opens, and most candidates are eliminated by a learned precursor-mass + RT gate before any fragmentation happens.
Modes
MslDbBuilder [--verify] [--lean] <outDir> <fasta...>— FASTA → one.mslper database.--shardbuild <outDir> <numShards> <fasta...|@listfile>— build N lean, size-balanced merged shards directly from FASTAs, streaming, each entry taggeddb|accession, with Chronologer iRT (the production path).--merge <out.msl> <msl...>— combine many.mslinto one merged file.--read/--probe/--checkrt/--rtcalib/--streamtest— round-trip, inspection, RT-calibration and writer-test utilities.How we arrived at the production design (160 shards) — the sizing journey
The goal was a single searchable library covering all ~17,300 bacterial proteomes (~30 GB of input FASTA). Getting there was not obvious; here is everything we tried and why we ended where we did.
1. "Just build one giant merged
.msl." — Impossible.The
.mslformat stores its entry/precursor count as a 32-bitint(cap ≈ 2.147 B, i.e. 2³¹). The full bacterial set is billions of target+decoy peptide entries — well past that cap. A single merged file simply cannot represent it. → We must shard.2. "A handful of big shards (~4–6)." — The right instinct, wrong first cut.
Sizing said ~108 bytes/entry, ~11–12× the FASTA size, ~338–377 GB total. A few shards each under the 2.1 B cap looked sufficient on paper, so the first prototype tried a small shard count.
3. First prototype: 150 databases → 3 shards. — 2 of 3 shards crashed.
Two distinct problems surfaced:
IOException(file in use). Per-shard parallel predictor construction is not safe.4. The real wall — the global iRT cache crash (the saga).
iRT is a pure function of the peptide sequence, not the organism, so the same sequence appears in thousands of proteomes. The fix for both speed and the contention above was a global iRT cache: predict each unique sequence once, reuse everywhere. We built it as a single
ConcurrentDictionary<string,double?>. It corrupted — throwingNullReferenceExceptionfrom inside its ownTryAddInternal/TryGetValue— once it grew past ~900 M entries, even with hundreds of GB of RAM free. We misdiagnosed it twice: first as a TorchSharp/native crash, then as memory pressure. The true cause is a .NET internal scale limit: a single dictionary's backing bucket array can't resize cleanly at that magnitude. Single-threaded inserts crashed too, so it was not a concurrency misuse.ShardedSeqCache, key routed byhash & 127). Each sub-dictionary holds only tens of millions, so every resize is a small allocation. This held 2.30 B unique sequences without issue, and made parallel preload safe (cache reload dropped from ~68 min to ~7 min).5. Architecture pivot off the back of that.
irt_cache.tsv(kept after the build — it makes re-builds and library extensions nearly free, since the expensive iRT is never re-paid)..mslis a durable checkpoint, and shard concurrency is decoupled from shard count (MSLDB_SHARD_CONCURRENCY), bounding peak RAM. An interrupt loses at most the in-flight wave's digest+write (iRT is already banked).LibrarySpectrum/MzSpectrumfor fragments that a lean entry doesn't carry (byte-identical output, far fewer allocations).X→ NaN m/z) poisoned the index range query and returned empty for the whole library; those peptides are now skipped at build time.6. Choosing the final shard count: 160.
Three constraints drove it:
db|accession.Greedy size-balanced packing of 17,300 whole databases into shards of ~2 GB lands at 160 shards: ~18 M entries and ~2.25 GB per shard, ~108 databases per shard, 358 GB total. (Two shards would technically fit under the cap, but each would be ~180 GB to load — unworkable; 160 keeps each shard cheap to load and resumable.)
7. Final build & search numbers.
irt_cache.tsv, 127 GB, kept).Root-cause lessons (carried forward)
ConcurrentDictionarycorrupts past ~900 M entries regardless of free RAM — shard large dictionaries (this cost a multi-day misdiagnosis: TorchSharp → memory pressure → the real .NET scale limit).Notes for reviewers
mzLib/MslDbBuilder/. If the team would prefer it live elsewhere (e.g. atools/folder or a separate repo), say the word..mslAPI from MSL mzLib Spectrum Library indexed binary format #1036, so its base ismzlib_speclib; it will not build againstmasteruntil MSL mzLib Spectrum Library indexed binary format #1036 merges.🤖 Generated with Claude Code