universe: improve sync reliability and performance#2187
Conversation
Adds a SyncFixture that pairs two SQLite-backed universes (local and remote) with a SimpleSyncer wired to treat one side as remote, so a bench can drive SyncUniverse end-to-end without any network I/O. The direct-write registrar bypasses Archive verification because the seeded corpus is random proofs and the contention we want to observe lives in MultiverseStore.UpsertProofLeafBatch, not the verifier. SyncMetrics reports batches, leaves inserted, DB retry errors, and dependency-missing errors via b.ReportMetric so scenarios laid down in a follow-up commit surface each as its own benchstat column. Fraction is a bounded [0, 1] newtype used for LocalOverlap so a malformed seed spec fails at NewFraction rather than silently drifting the workload. SeedSpec keeps issuance and transfer as separate fields rather than a flat list tagged by proof type, so a caller cannot accidentally interleave the two.
Adds three benches that drive the sync fixture end to end: FreshLocal for the "first sync into an empty node" case, MostlySynced for the "resume after most leaves already present" case (which is the key one for demonstrating the SetDiff fix), and Mixed for a workload that interleaves issuance and transfer roots. The benches directly reproduce two of the three symptoms from issue lightninglabs#2026 with no code change: - FreshLocal/roots=50/leaves=200 crashes with "db tx retries exceeded: database is locked (5) (SQLITE_BUSY)" — the tx contention Phase 3 targets. - Every MostlySynced/* iteration inserts every remote leaf even when the local side already has 90% of them (e.g. 200 leaves fetched at leaves=200 when only 20 are new). This is the pointer-identity SetDiff bug Phase 1 targets. leaves_inserted reports the over-fetch each time. Baseline (Apple M4, sqlite, -benchtime=1x): MostlySynced/roots=10/leaves=50 2.64s 500 leaves (20 expected) MostlySynced/roots=10/leaves=200 22.1s 2000 leaves (200 expected) MostlySynced/roots=50/leaves=50 15.9s 2500 leaves (250 expected) MostlySynced/roots=50/leaves=200 138 s 10000 leaves (1000 expected) FreshLocal/roots=10/leaves=50 2.08s 500 leaves FreshLocal/roots=10/leaves=200 16.5s 2000 leaves FreshLocal/roots=50/leaves=50 13.0s 2500 leaves FreshLocal/roots=50/leaves=200 FAIL (db tx retries exceeded) Mixed/leaves=50 10.4s 2000 leaves Mixed/leaves=200 84.2s 8000 leaves bench/results/ is gitignored so the raw output lives in a scratch file locally; the numbers above are what the follow-up phases will compare against.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses reliability and performance issues in the universe sync pipeline. By optimizing how leaves are diffed, enforcing a logical order for root processing, and bounding concurrency to protect database stability, the changes significantly reduce redundant work and eliminate common failure modes observed at production scale. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly improves the performance and reliability of universe sync by introducing bounded concurrency, partitioning roots to process issuance before transfer, and implementing content-based leaf key diffing to prevent redundant fetches. It also reduces the default sync batch size to mitigate database transaction contention. The review feedback suggests adding defensive nil checks in the new diffLeafKeys function to prevent potential nil pointer dereference panics when handling interface slices.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
fn.SetDiff keys its intermediate maps by the LeafKey interface value, which for the common BaseLeafKey implementation compares the embedded *asset.ScriptKey by pointer address, not by pubkey content. tapdb's mintingKeys allocates a fresh ScriptKey per row per call, so the remote and local sides of the syncer's diff never observe overlapping pointer identities — even when the underlying content is identical. The result: on a mostly-synced node, every remote leaf looks new, and the syncer re-fetches the entire tree on each pass. This is the root cause of the "instantaneous 100MB traffic spikes" reported in issue lightninglabs#2026. Replace the call at syncer.go:336 with a small diffLeafKeys helper keyed by UniverseKey() — the same 32-byte content hash the multiverse stores under. The helper preserves remote's relative order too, which the downstream error-attribution loop indexes into. Unit tests in universe/syncer_test.go pin the direct regression (pointer-invariance) plus subset and subsequence semantics. Rapid property tests in syncer_diff_property_test.go elevate the same invariants to universally-quantified form, including extensional agreement with a slower reference impl comparing (outpoint, script pubkey) directly. Bench (Apple M4, -benchtime=1x), MostlySynced with 90% overlap: roots=10/leaves=50 2.64s -> 0.39s 500 -> 50 leaves inserted roots=10/leaves=200 22.1s -> 2.59s 2000 -> 200 leaves inserted
The previous executeSync ran every remote root through a single fn.ParSlice, in whatever order fetchAllRoots returned them. A transfer root could race ahead of the issuance root it depends on; when the archive verifier fetched the previous asset snapshot it would surface "no universe proof found" and abort the sync, which in issue lightninglabs#2026 the reporter observed as an ongoing loop of partial syncs. Introduce SortedRoots as a typed partition — Issuance and Transfer are separate fields, structurally impossible to interleave — and run the syncRoots fan-out twice, issuance first. syncRoots itself is factored out for reuse in the follow-up concurrency commit, where the fan-out swaps in a bounded worker pool. TestExecuteSync_IssuanceBeforeTransfer wires a minimal DiffEngine that records RootNode calls and asserts every issuance-typed call precedes every transfer-typed call. Rapid property tests in syncer_partition_property_test.go pin the three invariants of SortedRoots: soundness (each bucket holds only its own proof type), totality (every recognised root lands in exactly one bucket), and order-preservation within each bucket. ProofTypeUnspecified is included in the generator to cover the drop case matching pre- partition uniIdSyncFilter behaviour.
Two changes wired together to relieve the DB tx contention issue lightninglabs#2026 reports as "db tx retries exceeded": - SimpleSyncCfg gains an internal SyncRootConcurrency knob and syncRoots swaps its unbounded fn.ParSlice for a bounded errgroup.SetLimit worker pool. NewSimpleSyncer clamps non-positive inputs to 1 so the internal fan-out can trust the "at least one worker" invariant without defensive checks. - defaultUniverseSyncBatchSize drops from 200 to 50, shortening per-tx write hold times. defaultUniverseSyncRootConcurrency lands at 2 — the value the issue reporter observed as retry-free in their environment. The knob is kept internal for now; no CLI flag. tapcfg wires production to (batch=50, concurrency=2). The bench fixture picks the same defaults so scenarios reflect the shipped workload. TestSyncRoots_HonoursConcurrencyCap spins up a probe with a live in-flight gauge and asserts peak concurrency never exceeds the cap across cap ∈ {1, 2, 4, 8}. TestNewSimpleSyncer_ClampsNonPositive Concurrency pins the constructor invariant. The ordering test introduced in the partition commit is upgraded here now that SyncRootConcurrency exists: it runs at concurrency 8 with a 1ms sleep in the recorder so a hypothetical single-pool refactor collapsing the two syncRoots calls would fail with observably interleaved orderings rather than passing by accident. Bench (Apple M4, -benchtime=1x): FreshLocal/roots=50/leaves=200 FAIL (db tx retries) -> 98.2s 0 retries
Resolves #2026.
This should substantially improve universe sync reliability and performance. In some cases the performance improvement/bandwidth reduction can be multiple orders of magnitude, and some DB contention/failure modes are eliminated entirely.
Codex's estimate of the reliability/performance wins:
The PR itself fixes three issues, each in a different layer of the universe sync pipeline (Opus's summary follows):