Skip to content

Add Modal deploy guardrail tooling#145

Open
therealityreport wants to merge 4 commits into
mainfrom
codex/publish/trr-backend/20260610-modal-deploy-guardrails
Open

Add Modal deploy guardrail tooling#145
therealityreport wants to merge 4 commits into
mainfrom
codex/publish/trr-backend/20260610-modal-deploy-guardrails

Conversation

@therealityreport

@therealityreport therealityreport commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

  • recover the missing Modal deploy wrapper, API canary helper, and wrong-workspace cleanup script
  • add tests for deploy safety and lazy imports
  • make wrong-workspace Modal stop non-interactive with --yes

Verification

  • .venv/bin/python -m pytest tests/scripts/test_cleanup_wrong_workspace_modal.py tests/scripts/test_deploy_backend_modal.py tests/utils/test_lazy_imports.py tests/scripts/test_verify_modal_readiness.py
  • git diff --check

Orchestrator

  • target: therealityreport/trr-backend
  • action: publish-and-sync
  • mode: takeover
  • source branch: codex/recover-modal-deploy-guardrails

Summary by CodeRabbit

  • Documentation

    • Added an incident report detailing a backend API crash loop, root cause, remediation, verification, and deploy history.
  • New Features

    • Added an API cold-start /health canary run after deployments.
    • Added a CLI to detect and optionally stop mistaken workspace deployments.
    • Added a local loopback internal-admin proxy bypass for admin requests.
    • Added a lazy-import helper to defer optional module loading.
  • Chores

    • Improved deploy safety: workspace validation, image payload checks, readiness verification, and incident stamping.
  • Tests

    • Added unit tests covering deploy tooling, canary behavior, cleanup CLI, auth proxy behavior, and lazy-imports.

Copilot AI review requested due to automatic review settings June 10, 2026 14:42
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds Modal deployment and observability tooling: lazy imports, API cold-start canary helpers, a deploy orchestration CLI that enforces workspace and stamps deploy history, a wrong-workspace cleanup CLI, local loopback admin-proxy support, unit tests, and an incident runbook.

Changes

Modal Deployment Tooling and Observability

Layer / File(s) Summary
Lazy module loading utility
trr_backend/utils/lazy_imports.py, tests/utils/test_lazy_imports.py
LazyModule defers importlib.import_module until first attribute access; test verifies import occurs only on first access.
API canary health check helpers
scripts/modal/api_canary.py
health_url() normalizes base URL and appends /health; run_api_cold_start_canary() retries /health with configurable attempts/timeout and returns structured results; skipped_api_canary() returns a not-ran payload.
Deploy backend orchestration and verification
scripts/modal/deploy_backend.py, tests/scripts/test_deploy_backend_modal.py
Orchestrator pins MODAL_PROFILE, enforces required workspace, builds/runs modal deploy, runs readiness probe and optional API cold-start canary, fetches deploy history, and stamps incident notes. Tests cover env pinning, workspace validation, command builders, readiness/canary behavior, history stamping, and note path resolution.
Wrong workspace deployment cleanup utility
scripts/modal/cleanup_wrong_workspace_deploy.py, tests/scripts/test_cleanup_wrong_workspace_modal.py
CLI detects and optionally stops Modal apps deployed to a wrong workspace by verifying authoritative readiness, validating profile/workspace resolution, checking app presence/history, and optionally issuing modal app stop. Tests validate readiness blocking, successful stop, and rejection when the wrong profile resolves to authoritative.
Local loopback internal-admin proxy
api/auth.py, tests/api/test_auth.py
Adds x-trr-local-admin-proxy handling plus host normalization and loopback detection; require_internal_admin can short-circuit to a constructed internal-admin identity when proxied from loopback. Tests verify acceptance from loopback and rejection of remote spoofing.
Incident report documentation
docs/observability/modal-v439-v440-serve-backend-api-crash-loop-2026-05-28.md
Runbook documenting Modal serve_backend_api crash loop (v439→v440), root cause (eager scripts imports + missing mount), remediations (lazy-loading, minimal mounts, payload validation, post-deploy canary), follow-up guard regression (v442) and fix (v443), verification, and deploy history stamp.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A lazy hop, imports delayed till need,

Canaries knock softly on /health indeed,
Deploys get stamped, wrong-workspace apps unmade,
A loopback whisper grants an admin’s aid,
The rabbit nibbles logs and hums with glee.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.74% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add Modal deploy guardrail tooling' directly and accurately describes the main objective: adding Modal deploy safety infrastructure including deploy wrapper, API canary helpers, wrong-workspace cleanup script, and supporting tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/publish/trr-backend/20260610-modal-deploy-guardrails

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Exhaustive Code Review

Findings

  1. High - deploy wrapper cannot complete post-deploy validation
    In scripts/modal/deploy_backend.py, build_readiness_command() passes --probe-api-canary and --api-canary-timeout-seconds to scripts/modal/verify_modal_readiness.py. That readiness CLI does not define either argument in its parser, so after modal deploy succeeds, verify_deployed_readiness() will raise from subprocess.run(..., check=True), main() will print Modal deploy canary failed, and the wrapper exits 1. Practical impact: the new “required deploy path” reports failed deploys even when the deploy succeeded, and never reaches the intended fallback canary or incident stamp. Smallest fix: either add those flags and api_canary output to verify_modal_readiness.py, or remove the unsupported flags and rely on the existing fallback run_api_cold_start_canary(readiness["api_web_url"]). Add a test that validates the built readiness command against the real readiness parser.

No other blocking findings found in the PR diff.

Notes

The incident note says api.routers.admin_show_sync now lazy-loads scripts.sync.*, but admin_show_sync.py is unchanged in this PR range and still imports those modules at import time. I did not count that as a newly introduced regression because it is outside the changed diff, but if this PR is meant to deliver that fix, the implementation is missing from the reviewed changes.

Validation: git diff --check passed, and AST parsing of all added Python files passed. I could not run the targeted pytest files because this runner has no pytest installed (python -m pytest -> No module named pytest).

backend_surface: Modal deployment/cleanup scripts, canary helper, lazy import utility.
contracts_changed: No API or DB contract changes in the diff.
migrations_added: None.
downstream_updates: None required from reviewed code.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds back operational guardrails around deploying the TRR backend on Modal (pinned workspace/profile deploy wrapper, API /health cold-start canary, and wrong-workspace cleanup), plus a small lazy-import helper and tests/docs to prevent a repeat of prior startup regressions.

Changes:

  • Introduces scripts/modal/deploy_backend.py wrapper that pins the Modal profile/workspace, runs readiness checks, and performs an API cold-start canary, optionally stamping an incident note with deploy history.
  • Adds scripts/modal/cleanup_wrong_workspace_deploy.py to stop mistaken deployments from a non-authoritative Modal workspace (non-interactive --yes).
  • Adds trr_backend/utils/lazy_imports.py and corresponding tests, plus new tests for the deploy/cleanup tooling and an incident note document.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
trr_backend/utils/lazy_imports.py Adds a simple lazy module proxy to defer optional imports until first use.
tests/utils/test_lazy_imports.py Tests lazy-import behavior (import only occurs on first attribute access).
scripts/modal/api_canary.py Adds shared /health canary helper for post-deploy cold-start verification.
scripts/modal/deploy_backend.py Adds Modal deploy wrapper with pinned profile/workspace, readiness gating, canary, and incident stamping.
tests/scripts/test_deploy_backend_modal.py Adds unit coverage for deploy wrapper behavior (workspace gating, command construction, stamping helpers).
scripts/modal/cleanup_wrong_workspace_deploy.py Adds non-authoritative workspace cleanup tool (readiness gate + optional stop).
tests/scripts/test_cleanup_wrong_workspace_modal.py Adds unit coverage for wrong-workspace cleanup behavior and stop command.
docs/observability/modal-v439-v440-serve-backend-api-crash-loop-2026-05-28.md Adds incident note used by the deploy wrapper for stamping deploy history context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/modal/deploy_backend.py
Comment thread scripts/modal/deploy_backend.py Outdated
Comment thread scripts/modal/deploy_backend.py Outdated
`https://modal.com/apps/admin-56995/main/deployed/trr-backend-jobs`
- Readiness:
- `ok = true`
- `modal_workspace.workspace = admin-56995`
Comment thread tests/scripts/test_deploy_backend_modal.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/modal/api_canary.py`:
- Around line 13-17: The health_url function currently only checks for non-empty
base but doesn't validate the URL scheme; update health_url to parse api_web_url
(e.g., using urllib.parse.urlparse) and ensure the scheme is either 'https' or
'http' before returning the health path, raising a RuntimeError if the scheme is
missing/invalid; keep the existing trimming/rstrip behavior and build the
returned string from the validated base.

In `@scripts/modal/deploy_backend.py`:
- Around line 160-189: The function format_deploy_history_stamp directly indexes
canary['url'], canary['status'], and canary['attempt'] which can raise KeyError
if canary is missing those keys or is None; update format_deploy_history_stamp
to defensively read canary via safe access (e.g., check if canary is truthy and
use canary.get('url', '<unknown>') / canary.get('status', '<unknown>') /
canary.get('attempt', '<unknown>') or otherwise substitute a clear default like
'N/A'), and optionally add a short warning log when expected keys are absent so
the rest of the stamp generation always succeeds without exceptions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5388d109-873f-4707-878e-cee9eaebaa38

📥 Commits

Reviewing files that changed from the base of the PR and between 21f774b and 525a343.

📒 Files selected for processing (8)
  • docs/observability/modal-v439-v440-serve-backend-api-crash-loop-2026-05-28.md
  • scripts/modal/api_canary.py
  • scripts/modal/cleanup_wrong_workspace_deploy.py
  • scripts/modal/deploy_backend.py
  • tests/scripts/test_cleanup_wrong_workspace_modal.py
  • tests/scripts/test_deploy_backend_modal.py
  • tests/utils/test_lazy_imports.py
  • trr_backend/utils/lazy_imports.py

Comment on lines +13 to +17
def health_url(api_web_url: str) -> str:
base = str(api_web_url or "").strip().rstrip("/")
if not base:
raise RuntimeError("Modal readiness did not return api_web_url for cold-start canary.")
return f"{base}/health"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add URL scheme validation for defense-in-depth.

The health_url function validates that the base URL is non-empty but does not verify the URL scheme. While api_web_url is expected to come from Modal's readiness API in trusted contexts, adding scheme validation (ensuring https:// or http://) would prevent potential SSRF if the input source changes or is compromised.

🛡️ Proposed fix to validate URL scheme
 def health_url(api_web_url: str) -> str:
     base = str(api_web_url or "").strip().rstrip("/")
     if not base:
         raise RuntimeError("Modal readiness did not return api_web_url for cold-start canary.")
+    if not base.startswith(("https://", "http://")):
+        raise RuntimeError(f"Invalid URL scheme for cold-start canary: {base}")
     return f"{base}/health"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/modal/api_canary.py` around lines 13 - 17, The health_url function
currently only checks for non-empty base but doesn't validate the URL scheme;
update health_url to parse api_web_url (e.g., using urllib.parse.urlparse) and
ensure the scheme is either 'https' or 'http' before returning the health path,
raising a RuntimeError if the scheme is missing/invalid; keep the existing
trimming/rstrip behavior and build the returned string from the validated base.

Comment thread scripts/modal/deploy_backend.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/modal/deploy_backend.py (1)

292-295: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Unguarded dict access can crash the deploy script if canary keys are missing.

Line 294 directly accesses canary['url'], canary['status'], and canary['attempt'] outside the try block (lines 281-291). If the readiness check returns api_canary with ok: true but missing these keys (e.g., due to an upstream API contract change in verify_modal_readiness.py), this will raise an unhandled KeyError and crash the deploy script.

The same issue was previously fixed in format_deploy_history_stamp at lines 164-166 by using .get() with defaults.

🛡️ Proposed fix to use safe dict access
     print(
         "Modal API cold-start canary passed: "
-        f"url={canary['url']} status={canary['status']} attempt={canary['attempt']}",
+        f"url={canary.get('url', '<unknown>')} status={canary.get('status', '<unknown>')} attempt={canary.get('attempt', '<unknown>')}",
         flush=True,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/modal/deploy_backend.py` around lines 292 - 295, The print statement
that formats canary info uses direct dict access (canary['url'],
canary['status'], canary['attempt']) which can raise KeyError if keys are
missing; change those to safe accesses using canary.get('url', '<unknown>'),
canary.get('status', '<unknown>') and canary.get('attempt', '<unknown>')
(mirroring the safe access used in format_deploy_history_stamp) so the
deploy_backend.py output never crashes even if verify_modal_readiness.py omits
keys; update the same pattern anywhere else in this function that reads from the
canary dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/modal/deploy_backend.py`:
- Around line 292-295: The print statement that formats canary info uses direct
dict access (canary['url'], canary['status'], canary['attempt']) which can raise
KeyError if keys are missing; change those to safe accesses using
canary.get('url', '<unknown>'), canary.get('status', '<unknown>') and
canary.get('attempt', '<unknown>') (mirroring the safe access used in
format_deploy_history_stamp) so the deploy_backend.py output never crashes even
if verify_modal_readiness.py omits keys; update the same pattern anywhere else
in this function that reads from the canary dict.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e0863550-7fbe-42f0-8219-e3e2302fdcee

📥 Commits

Reviewing files that changed from the base of the PR and between 525a343 and 4c1c05c.

📒 Files selected for processing (4)
  • docs/observability/modal-v439-v440-serve-backend-api-crash-loop-2026-05-28.md
  • scripts/modal/api_canary.py
  • scripts/modal/deploy_backend.py
  • tests/scripts/test_deploy_backend_modal.py
✅ Files skipped from review due to trivial changes (1)
  • docs/observability/modal-v439-v440-serve-backend-api-crash-loop-2026-05-28.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/modal/api_canary.py
  • tests/scripts/test_deploy_backend_modal.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/auth.py`:
- Around line 55-67: The _local_internal_admin_proxy_allowed check currently
trusts a client-supplied marker header plus loopback addresses which allows
bypass; make this opt-in by requiring a server-controlled credential before
returning _build_local_internal_admin_identity(): add a server-side
secret/config (e.g. _LOCAL_INTERNAL_ADMIN_SECRET) and validate the incoming
marker header value against that secret (or validate a proxy-signed token) in
_local_internal_admin_proxy_allowed (and similar logic used around lines
~300-302) in addition to the loopback checks; only if the header matches the
configured secret and both client.host and Host are loopback should you allow
minting the local internal admin identity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5aaad9f1-c17a-46a0-9312-152904ff8018

📥 Commits

Reviewing files that changed from the base of the PR and between 4c1c05c and 9ca1486.

📒 Files selected for processing (3)
  • api/auth.py
  • scripts/modal/deploy_backend.py
  • tests/api/test_auth.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/modal/deploy_backend.py

Comment thread api/auth.py
Comment on lines +55 to +67
def _local_internal_admin_proxy_allowed(request: Request) -> bool:
if (request.headers.get(_LOCAL_INTERNAL_ADMIN_PROXY_HEADER) or "").strip().lower() not in {
"1",
"true",
"yes",
"on",
}:
return False
client_host = request.client.host if request.client else ""
if not _is_loopback_host(client_host):
return False
return _is_loopback_host(request.headers.get("host"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This opens an unauthenticated internal-admin path behind any loopback proxy.

This branch trusts a caller-set marker header plus request.client.host/Host to mint an internal_admin identity. If the app is ever fronted by a same-host reverse proxy, public requests will also arrive with a loopback client address, so forwarding x-trr-local-admin-proxy lets a remote caller bypass the JWT/shared-secret checks and even choose x-trr-admin-* identity fields. Please make this path opt-in and require a server-controlled credential before returning _build_local_internal_admin_identity().

Minimal hardening direction
 def _local_internal_admin_proxy_allowed(request: Request) -> bool:
+    if not _env_flag_strict("TRR_INTERNAL_ADMIN_ALLOW_LOCAL_LOOPBACK_PROXY", False):
+        return False
     if (request.headers.get(_LOCAL_INTERNAL_ADMIN_PROXY_HEADER) or "").strip().lower() not in {
         "1",
         "true",
         "yes",
         "on",
     }:
         return False
     client_host = request.client.host if request.client else ""
     if not _is_loopback_host(client_host):
         return False
-    return _is_loopback_host(request.headers.get("host"))
+    return _is_loopback_host(request.headers.get("host")) and _internal_admin_secret_matches(request)

Also applies to: 300-302

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/auth.py` around lines 55 - 67, The _local_internal_admin_proxy_allowed
check currently trusts a client-supplied marker header plus loopback addresses
which allows bypass; make this opt-in by requiring a server-controlled
credential before returning _build_local_internal_admin_identity(): add a
server-side secret/config (e.g. _LOCAL_INTERNAL_ADMIN_SECRET) and validate the
incoming marker header value against that secret (or validate a proxy-signed
token) in _local_internal_admin_proxy_allowed (and similar logic used around
lines ~300-302) in addition to the loopback checks; only if the header matches
the configured secret and both client.host and Host are loopback should you
allow minting the local internal admin identity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants