Fix/relaymon tra off hot path#929
Open
dskvr wants to merge 5 commits into
Open
Conversation
Remediation queue draining used the same deletion helper as ordinary NIP-09 publishes, but daemon startup drained that queue while deletion publishing was still suppressed for warmup. That made valid queued deletions return false before any kind:5 event was sent, then remediation logged the skipped local publish as NOT ACK'd and left the row queued. Move the remediation drain until immediately after warmup finishes, where runWarmup's finally block has re-enabled deletion publishing. The queue still retries rows that relays reject or fail to ACK, but valid entries are no longer blocked by local startup suppression. Constraint: Deletion publishing is intentionally suppressed during warmup for normal check processing. Rejected: Bypass suppression inside deleteRelayCheckEvent | that would weaken the warmup guard for every deletion caller. Confidence: high Scope-risk: narrow Directive: Keep remediation queue drain after runWarmup unless deletion.ts gains caller-specific suppression semantics. Tested: deno test --no-check --unstable-sloppy-imports --allow-net --allow-env --allow-read --allow-write --allow-run tests/unit/deletion.test.ts tests/unit/nato-purge.test.ts tests/unit/nostrings-sweep.test.ts Tested: git diff --check -- apps/relaymon/src/core/daemon.ts Not-tested: live Fisherman deployment and relay ACK drain
Trusted Relay Assertion (kind 30385) processing ran inline on every relay check: recordTrustedRelayObservation + buildTrustedRelayAssertion do synchronous SQLite work, including a read-back of up to max_observations_per_relay history rows plus per-sample scoring. On Deno's single-threaded event loop (and a CPU-constrained host) those bursts starved in-flight WebSocket I/O and busted the tight check budgets (read/open timeouts), so healthy relays were falsely reported offline. This is what took Fisherman down once TRA was enabled there. Decouple TRA from the check hot path: - processRelay now calls enqueueTrustedRelayObservation(), a cheap guard + array push (no DB work). The queue is capped (drops oldest, warns) so it can't grow unbounded during warmup. - A lazily-started background processor drains the queue one relay at a time and yields the event loop (delay) between relays, so the synchronous per-relay work never sustains a block long enough to starve checks. Start is deferred via setTimeout so the first item never runs inside the check's call stack. - Throttle is configurable via trustedRelayAssertions.processing_throttle_ms (default 25ms). publishTrustedRelayAssertion (the per-item worker) is unchanged, so its record/build/gate/publish behavior and existing coverage are preserved. Tests: add coverage that enqueue defers work (nothing published synchronously; queue drains via processTrustedRelayQueueOnce), that it is a no-op when TRA is disabled, and that the queue caps memory by dropping oldest. Full unit suite: 369 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Warmup intentionally works through the entire unchecked/expired backlog, which can take far longer than checkIdleMs (default 5m). The daemon only bumped the check-loop heartbeat once per batch enqueue, so during the long batch drain the heartbeat went stale while expiredRelaysCount > 0 -> checkCheckLoop returned "fail" -> health snapshot DOWN -> Kuma showed the live, actively-checking monitor as offline. Make warmup count as healthy activity: - daemon: bump heartbeatTracker.checkLoop after each warmup check (finally), so the heartbeat reflects real progress instead of going stale mid-batch. - checkCheckLoop: take a warmupActive flag. During warmup the expected backlog is never a hard fail/DOWN — pass while the heartbeat shows progress (or warmup just started), warn only if genuinely stalled past checkIdleMs. - buildHealthSnapshot: read queueManager.warmupActive and, while warming up, keep state UP unless the check loop is stalled (warn -> degraded) or a real critical failure occurs (db/signing -> down). Saturated check queue, publish backlog and elevated error rate are expected during warmup. Tests: add health-warmup.test.ts covering the warmup vs non-warmup checkCheckLoop branches. Full unit suite: 373 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t resilience, nocap finish race (#927) * fix(relaymon): dynamic NIP-11 dedup, offline self-heal, kuma heartbeat & nocap finish race Root causes (evidence-based, incl. live fisherman inspection): 1. Hostname dedup built its family from ONLINE relays only and fell back to a pure shortest-URL-wins "defensive-deny" + static override allowlists. When a hostname's root was offline (the production case: relay.29t.com root + dozens of NATO-word path-spam, all offline, all ignore=0), no canonical participated and spam was never deduped — or a legit distinct path (lang.relays.land/en) was wrongly ignored because only hardcoded hosts were rescued. It also swept getRelayInfo() over EVERY online relay per check — an O(N) synchronous SQLite sweep on Deno's single thread (the same event-loop-stall class as the TRA bug). 2. The remediation/re-eval that would heal mislabelled rows was disabled and, in any case, scoped to online+unignored rows — so the offline spam backlog never converged. 3. status.nostr.watch showed fisherman offline even though health computed UP: the KumaPusher push flapped on a transient container-IPv6 "connection refused", and backoff min(2^fails,16) stretched the heartbeat to ~32min (reset only on success), keeping the monitor DOWN long after recovery. 4. nocap logged "Ignoring open check because the promise was already fulfilled when finish() was called" on every offline relay: websocket_hard_fail() resolved the deferred directly, bypassing the timeout clear, so the open timeout fired later and called finish() a second time. Fixes: - hostnames.ts: rewrite relayHostnameDedup to be PURELY DYNAMIC (no static override lists). Family = ALL known siblings (online AND offline) via getRelaysByHostname; canonical = root (weighted to win), else shortest. A path survives iff its NIP-11 proves different functionality from the canonical; no-NIP-11 / identical-NIP-11 paths lose to the canonical. O(family): NIP-11 is read only for the canonical, never the whole online set. Retire the dedup-overrides modules + tests entirely. - remediation.ts + db.ts: add rerunDedupAllUnignoredMigration — a one-time, pure-DB (no network), event-loop-yielding pass over ALL unignored rows (online+offline) so the existing backlog self-heals at boot and queues kind:5 deletions. Wired into initializeDB (supersedes the disabled online-only pass). - kuma.ts + types.ts: cap heartbeat backoff (DEFAULT_MAX_BACKOFF_MULTIPLIER=4, configurable via health.kuma.maxBackoffMultiplier) so a transient push blip can't keep the monitor reported DOWN after recovery. - nocap Base.ts: clear the pending check timeout in websocket_hard_fail() before resolving directly, and downgrade the now-rare ignore_result guard to debug. Tests (RED-first): hostname-dedup-dynamic, dedup-remediation-offline, kuma-backoff, nocap timeout-lifecycle; existing hostname-dedup suite updated to the dynamic contract. Full relaymon unit suite: 359 passed. relaymon builds (deno task compile). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(relaymon): trim verbose comments to concise intent + caveats Reduce the SDLC-narration comments added in this PR (phase/milestone refs, restated history, RED-test rationale) to durable one-liners, keeping only the genuine caveats. No behaviour change; 359 unit tests + nocap timeout test still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: sandwich <dskvr@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
next gained the TRA hot-path change (#926) and other work; the only conflict was whitespace/formatting in daemon.ts (seeder declaration + dynamic import wrap) — identical code. Kept the formatted version. relaymon unit tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.