Skip to content

Fix/relaymon tra off hot path#929

Open
dskvr wants to merge 5 commits into
nextfrom
fix/relaymon-tra-off-hot-path
Open

Fix/relaymon tra off hot path#929
dskvr wants to merge 5 commits into
nextfrom
fix/relaymon-tra-off-hot-path

Conversation

@dskvr

@dskvr dskvr commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

dskvr and others added 5 commits June 14, 2026 01:30
Remediation queue draining used the same deletion helper as ordinary NIP-09 publishes, but daemon startup drained that queue while deletion publishing was still suppressed for warmup. That made valid queued deletions return false before any kind:5 event was sent, then remediation logged the skipped local publish as NOT ACK'd and left the row queued.

Move the remediation drain until immediately after warmup finishes, where runWarmup's finally block has re-enabled deletion publishing. The queue still retries rows that relays reject or fail to ACK, but valid entries are no longer blocked by local startup suppression.

Constraint: Deletion publishing is intentionally suppressed during warmup for normal check processing.

Rejected: Bypass suppression inside deleteRelayCheckEvent | that would weaken the warmup guard for every deletion caller.

Confidence: high

Scope-risk: narrow

Directive: Keep remediation queue drain after runWarmup unless deletion.ts gains caller-specific suppression semantics.

Tested: deno test --no-check --unstable-sloppy-imports --allow-net --allow-env --allow-read --allow-write --allow-run tests/unit/deletion.test.ts tests/unit/nato-purge.test.ts tests/unit/nostrings-sweep.test.ts

Tested: git diff --check -- apps/relaymon/src/core/daemon.ts

Not-tested: live Fisherman deployment and relay ACK drain
Trusted Relay Assertion (kind 30385) processing ran inline on every relay
check: recordTrustedRelayObservation + buildTrustedRelayAssertion do
synchronous SQLite work, including a read-back of up to
max_observations_per_relay history rows plus per-sample scoring. On Deno's
single-threaded event loop (and a CPU-constrained host) those bursts
starved in-flight WebSocket I/O and busted the tight check budgets
(read/open timeouts), so healthy relays were falsely reported offline.
This is what took Fisherman down once TRA was enabled there.

Decouple TRA from the check hot path:
- processRelay now calls enqueueTrustedRelayObservation(), a cheap guard +
  array push (no DB work). The queue is capped (drops oldest, warns) so it
  can't grow unbounded during warmup.
- A lazily-started background processor drains the queue one relay at a
  time and yields the event loop (delay) between relays, so the synchronous
  per-relay work never sustains a block long enough to starve checks. Start
  is deferred via setTimeout so the first item never runs inside the check's
  call stack.
- Throttle is configurable via trustedRelayAssertions.processing_throttle_ms
  (default 25ms).

publishTrustedRelayAssertion (the per-item worker) is unchanged, so its
record/build/gate/publish behavior and existing coverage are preserved.

Tests: add coverage that enqueue defers work (nothing published
synchronously; queue drains via processTrustedRelayQueueOnce), that it is a
no-op when TRA is disabled, and that the queue caps memory by dropping
oldest. Full unit suite: 369 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Warmup intentionally works through the entire unchecked/expired backlog,
which can take far longer than checkIdleMs (default 5m). The daemon only
bumped the check-loop heartbeat once per batch enqueue, so during the long
batch drain the heartbeat went stale while expiredRelaysCount > 0 ->
checkCheckLoop returned "fail" -> health snapshot DOWN -> Kuma showed the
live, actively-checking monitor as offline.

Make warmup count as healthy activity:
- daemon: bump heartbeatTracker.checkLoop after each warmup check (finally),
  so the heartbeat reflects real progress instead of going stale mid-batch.
- checkCheckLoop: take a warmupActive flag. During warmup the expected
  backlog is never a hard fail/DOWN — pass while the heartbeat shows progress
  (or warmup just started), warn only if genuinely stalled past checkIdleMs.
- buildHealthSnapshot: read queueManager.warmupActive and, while warming up,
  keep state UP unless the check loop is stalled (warn -> degraded) or a real
  critical failure occurs (db/signing -> down). Saturated check queue,
  publish backlog and elevated error rate are expected during warmup.

Tests: add health-warmup.test.ts covering the warmup vs non-warmup
checkCheckLoop branches. Full unit suite: 373 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t resilience, nocap finish race (#927)

* fix(relaymon): dynamic NIP-11 dedup, offline self-heal, kuma heartbeat & nocap finish race

Root causes (evidence-based, incl. live fisherman inspection):

1. Hostname dedup built its family from ONLINE relays only and fell back to a
   pure shortest-URL-wins "defensive-deny" + static override allowlists. When a
   hostname's root was offline (the production case: relay.29t.com root + dozens
   of NATO-word path-spam, all offline, all ignore=0), no canonical participated
   and spam was never deduped — or a legit distinct path (lang.relays.land/en)
   was wrongly ignored because only hardcoded hosts were rescued. It also swept
   getRelayInfo() over EVERY online relay per check — an O(N) synchronous SQLite
   sweep on Deno's single thread (the same event-loop-stall class as the TRA bug).

2. The remediation/re-eval that would heal mislabelled rows was disabled and, in
   any case, scoped to online+unignored rows — so the offline spam backlog never
   converged.

3. status.nostr.watch showed fisherman offline even though health computed UP:
   the KumaPusher push flapped on a transient container-IPv6 "connection refused",
   and backoff min(2^fails,16) stretched the heartbeat to ~32min (reset only on
   success), keeping the monitor DOWN long after recovery.

4. nocap logged "Ignoring open check because the promise was already fulfilled
   when finish() was called" on every offline relay: websocket_hard_fail()
   resolved the deferred directly, bypassing the timeout clear, so the open
   timeout fired later and called finish() a second time.

Fixes:

- hostnames.ts: rewrite relayHostnameDedup to be PURELY DYNAMIC (no static
  override lists). Family = ALL known siblings (online AND offline) via
  getRelaysByHostname; canonical = root (weighted to win), else shortest. A path
  survives iff its NIP-11 proves different functionality from the canonical;
  no-NIP-11 / identical-NIP-11 paths lose to the canonical. O(family): NIP-11 is
  read only for the canonical, never the whole online set. Retire the
  dedup-overrides modules + tests entirely.
- remediation.ts + db.ts: add rerunDedupAllUnignoredMigration — a one-time,
  pure-DB (no network), event-loop-yielding pass over ALL unignored rows
  (online+offline) so the existing backlog self-heals at boot and queues kind:5
  deletions. Wired into initializeDB (supersedes the disabled online-only pass).
- kuma.ts + types.ts: cap heartbeat backoff (DEFAULT_MAX_BACKOFF_MULTIPLIER=4,
  configurable via health.kuma.maxBackoffMultiplier) so a transient push blip
  can't keep the monitor reported DOWN after recovery.
- nocap Base.ts: clear the pending check timeout in websocket_hard_fail() before
  resolving directly, and downgrade the now-rare ignore_result guard to debug.

Tests (RED-first): hostname-dedup-dynamic, dedup-remediation-offline,
kuma-backoff, nocap timeout-lifecycle; existing hostname-dedup suite updated to
the dynamic contract. Full relaymon unit suite: 359 passed. relaymon builds
(deno task compile).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(relaymon): trim verbose comments to concise intent + caveats

Reduce the SDLC-narration comments added in this PR (phase/milestone refs,
restated history, RED-test rationale) to durable one-liners, keeping only the
genuine caveats. No behaviour change; 359 unit tests + nocap timeout test still
pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: sandwich <dskvr@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
next gained the TRA hot-path change (#926) and other work; the only conflict
was whitespace/formatting in daemon.ts (seeder declaration + dynamic import
wrap) — identical code. Kept the formatted version. relaymon unit tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant