Skip to content

fix(roast): persist jobs + SSE in Redis to stop "job not found" 404s#4

Merged
hp-8 merged 1 commit into
mainfrom
fix/redis-job-store
Jun 17, 2026
Merged

fix(roast): persist jobs + SSE in Redis to stop "job not found" 404s#4
hp-8 merged 1 commit into
mainfrom
fix/redis-job-store

Conversation

@hp-8

@hp-8 hp-8 commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Problem

Roast job state + the SSE event log lived in process RAM. Any worker restart on Render free tier (idle spin-down, OOM during a swarm, redeploy) wiped the store, so GET /api/roast/<id>/stream and the poll endpoint returned 404 job not found mid-run.

Fix

State store, not a task queue — pipeline still runs in-process; Redis just makes state survive restarts.

  • services/job_store.py — pluggable store:
    • RedisJobStore (when REDIS_URL set): job state, a replayable event log, and the cancel flag, all with a TTL.
    • InMemoryJobStore (default): original behaviour → dev/test/CI unchanged.
    • make_store() falls back to in-memory if Redis is unreachable, so boot never fails.
  • Pipeline checkpoints state at each stage + completion. SSE replays the full event log from the start → reconnecting/late clients catch up.
  • Staleness guard: a non-terminal job whose pipeline thread died (worker crash) surfaces as failed instead of hanging forever.
  • Cancel uses a Redis flag + TTL expiry → reconnect sees cancelled, not 404.
  • Config: REDIS_URL, ROAST_JOB_TTL, ROAST_STALE_SECONDS. Wired in render.yaml + .env.example. redis>=5.0.0 added across requirements/-prod/pyproject/uv.lock.
  • Bump transitive form-data 4.0.5→4.0.6 (audit gate, GHSA-hmw2-7cc7-3qxx; pre-existing, unrelated).

Tests

  • backend/tests/test_job_store.py — both backends (Redis via in-process fake), serialization roundtrip, factory fallback.
  • Full ci-local.sh green: backend pytest (259 passed, 9 skipped) + ruff, frontend eslint/vitest/build/prod-audit.

Deploy step (required)

Create an Upstash Redis DB → set REDIS_URL=rediss://... in Render swarmie-backend env → redeploy. Without it, store stays in-memory and the 404 persists.

Scope note

Mid-run OOM still ends that live run (thread dies with the worker) — now shown as a clean "interrupted, start again" instead of a 404. Surviving OOM mid-run needs the deferred task-queue path.

Roast job state and the SSE event log lived in process RAM, so any worker
restart (idle spin-down, OOM, redeploy on Render free tier) wiped the store
and made /stream and poll return 404 mid-run.

- New services/job_store.py: pluggable store. RedisJobStore (when REDIS_URL
  is set) persists job state, a replayable event log, and the cancel flag with
  a TTL; InMemoryJobStore keeps the original behaviour for dev/test. Factory
  falls back to in-memory if Redis is unreachable so boot never fails.
- Pipeline checkpoints state at each stage + on completion; SSE now replays
  the full event log from the start so reconnecting/late clients catch up.
- Staleness guard: a non-terminal job with a dead pipeline thread surfaces as
  failed instead of hanging forever.
- Cancel uses a Redis flag and lets state expire via TTL (reconnect sees
  cancelled, not 404).
- Add redis dep (requirements/-prod/pyproject/uv.lock), REDIS_URL +
  ROAST_JOB_TTL + ROAST_STALE_SECONDS config, render.yaml + .env.example.
- Tests: backend/tests/test_job_store.py (both backends, factory fallback).
- Bump transitive form-data 4.0.5 -> 4.0.6 (audit gate, GHSA-hmw2-7cc7-3qxx).
@vercel

vercel Bot commented Jun 17, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
swarmie Ready Ready Preview, Comment Jun 17, 2026 4:41pm

@hp-8 hp-8 merged commit f30f140 into main Jun 17, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant