Skip to content

Take over backend social scraper and Modal guardrail updates#146

Merged
therealityreport merged 26 commits into
mainfrom
codex/publish/trr-backend/20260616-takeover-all
Jun 16, 2026
Merged

Take over backend social scraper and Modal guardrail updates#146
therealityreport merged 26 commits into
mainfrom
codex/publish/trr-backend/20260616-takeover-all

Conversation

@therealityreport

Copy link
Copy Markdown
Owner

Summary\n- Captures backend SocialBlade, Chrome profile preflight, scraper, Modal guardrail, and social ingestion updates from the takeover branch set.\n- Includes fixes for unit-test DB isolation, explicit comment-anchor launch auth probing, Modal maintenance owner env isolation, and Bravo core account expectations.\n\n## Local validation\n- PASS: previously failing backend subset, 23 tests passed.\n- PARTIAL: targeted backend suite reached 480 passed before interrupting an unbounded live Postgres wait; no failures had appeared.\n- PASS: Modal follow-through was already completed for the SocialBlade/profile slice before PR orchestration, including API canary and readiness probe.\n\n## Notes\n- SQL ownership changed through included migrations; ledger/inventory should be reviewed with the backend diff.\n- The broad local backend command was interrupted only because it was blocked inside a live Postgres cursor execution.

- Decodo/auth public-only HARD GUARD (assert_public_comments_isolation + job_runner chokepoint)
- bug#1 foundation: counts.expected_reply_count + public child-fetch decision wired
- bug#4: unknown-count + zero-recovered no longer silently marked complete
- 23 new/relevant unit tests green (isolation/counts/completeness/proxy)

NOTE: bundles pre-existing in-progress public-comments WIP already present in the
working tree (not authored this session); committed on a feature branch as a
recoverable baseline before subagent-driven implementation of the full plan.
…ils' into codex/publish/trr-backend/20260616-takeover-all
…dex/publish/trr-backend/20260616-takeover-all
WS1a (fetcher.py, counts.py):
- T3 zero-reply-probe limit (env SOCIAL_INSTAGRAM_PUBLIC_COMMENTS_ZERO_REPLY_PROBE_LIMIT, default 0 = skip reply-less parents) — biggest public-lane per-post saving
- T4 public fast-fail timeouts (post 20s / child 10s, independent of auth deadlines)
- C1 bug#1: authenticated reply gates + fetch target + terminal clamp use expected_reply_count
- C2 bug#5/#9: 429 sleeps max(backoff,cooldown); cooldown record wrapped in except OSError
- C3 bug#10b: memory guardrail <= -> < (inclusive cap)

WS3 (persistence.py, social_season_analytics_impl.py):
- bug#3: _build_upsert_update_clause + coalesce_preserve_cols; author/url cols COALESCE-preserved (no NULL clobber on metadata-poor re-scrape)
- bug#10c: no-season reply parent-link seeded from DB + depth-ordered upsert

Tests: reconciled 3 status-only tests to new T3/C3 contract + added default-skip test;
new env-knob + upsert-clause unit tests. Full comments_scrapling suite 89 passed; the 5
job_runner failures are pre-existing DB-unavailable env failures (verified independent of
these changes by revert-retest).
Copilot AI review requested due to automatic review settings June 16, 2026 13:01
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Important

Review skipped

Too many files!

This PR contains 168 files, which is 18 over the limit of 150.

To get a review, narrow the scope:
• coderabbit review --type committed # exclude uncommitted changes
• coderabbit review --dir # limit to a subdirectory
• coderabbit review --base # compare against a closer base

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6e336c56-80d3-487f-b642-af6063574893

📥 Commits

Reviewing files that changed from the base of the PR and between 21f774b and 65233a1.

📒 Files selected for processing (168)
  • .env.example
  • .tmp-backfill-monitor.py
  • api/auth.py
  • api/main.py
  • api/routers/admin_show_sync.py
  • api/routers/socials/__init__.py
  • docs/observability/modal-v439-v440-serve-backend-api-crash-loop-2026-05-28.md
  • docs/workspace/instagram-comments-scrapling.md
  • requirements.lock.txt
  • requirements.modal.browser.lock.txt
  • requirements.modal.lean.lock.txt
  • requirements.modal.vision.lock.txt
  • scripts/db/social_control_plane_pressure_snapshot.py
  • scripts/dev/smoke_decodo_residential_proxy.py
  • scripts/dev/verify_external_runtime_contracts.py
  • scripts/getty_login_headed.py
  • scripts/getty_scrape_job.py
  • scripts/modal/api_canary.py
  • scripts/modal/cleanup_wrong_workspace_deploy.py
  • scripts/modal/deploy_backend.py
  • scripts/modal/deploy_sync_fixes_clean.py
  • scripts/modal/diagnose_instagram_comments_remote.py
  • scripts/modal/prepare_named_secrets.py
  • scripts/modal/reconcile_modal_runtime.py
  • scripts/modal/refresh_instagram_cookies_from_chrome.py
  • scripts/modal/render_cutover_commands.py
  • scripts/modal/repair_instagram_auth.py
  • scripts/modal/sync_fix_deploy_paths.json
  • scripts/modal/verify_instagram_posts_auth.py
  • scripts/modal/verify_instagram_public_history.py
  • scripts/modal/verify_modal_readiness.py
  • scripts/socials/instagram/enqueue_comments_audit_cursor_retries.py
  • scripts/socials/instagram/interactive_login.py
  • scripts/socials/instagram/probe_instagram_public_mode.py
  • scripts/socials/instagram/rebuild_comment_rollups.py
  • scripts/socials/instagram/smoke_posts_scrapling.py
  • scripts/socials/reddit/__init__.py
  • scripts/socials/reddit/smoke_bravo_refresh.py
  • scripts/socials/smoke_tiktok_youtube_posts.py
  • scripts/socials/verify_shared_account_save_proof.py
  • scripts/sync/episode_id_reconciliation.py
  • scripts/sync/sync_episode_appearances.py
  • scripts/sync/sync_seasons_episodes.py
  • scripts/sync/sync_show_cast.py
  • supabase/migrations/20260610173000_social_comment_source_account_indexes.sql
  • supabase/migrations/20260610190000_instagram_post_comment_rollups.sql
  • supabase/migrations/20260610203500_admin_runtime_settings.sql
  • tests/api/routers/test_admin_person_images.py
  • tests/api/routers/test_admin_show_sync.py
  • tests/api/routers/test_socials_season_analytics.py
  • tests/api/routers/test_socials_twitter_admin_routes.py
  • tests/api/test_admin_socials_landing_summary.py
  • tests/api/test_auth.py
  • tests/api/test_health.py
  • tests/api/test_screenalytics_ingest_endpoints.py
  • tests/api/test_screenalytics_runs_v2.py
  • tests/db/test_session.py
  • tests/fixtures/admin_socials_route_shape.json
  • tests/integrations/imdb/test_fullcredits_cast_parser.py
  • tests/integrations/imdb/test_title_metadata_client.py
  • tests/integrations/test_getty_local_prefetch.py
  • tests/repositories/test_admin_reddit_sources.py
  • tests/repositories/test_admin_show_reads_repository.py
  • tests/repositories/test_build_upsert_update_clause.py
  • tests/repositories/test_episodes.py
  • tests/repositories/test_reddit_flair_categorizer.py
  • tests/repositories/test_reddit_refresh.py
  • tests/repositories/test_social_control_plane_dispatch_runtime.py
  • tests/repositories/test_social_dispatch_stage_claims.py
  • tests/repositories/test_social_season_analytics.py
  • tests/scripts/test_cleanup_wrong_workspace_modal.py
  • tests/scripts/test_deploy_backend_modal.py
  • tests/scripts/test_deploy_sync_fixes_clean.py
  • tests/scripts/test_enqueue_comments_audit_cursor_retries.py
  • tests/scripts/test_episode_id_reconciliation.py
  • tests/scripts/test_instagram_posts_scrapling_smoke.py
  • tests/scripts/test_modal_auth_recovery_profile_pinning.py
  • tests/scripts/test_prepare_named_secrets.py
  • tests/scripts/test_smoke_decodo_residential_proxy.py
  • tests/scripts/test_sync_episode_appearances_season_coverage.py
  • tests/scripts/test_sync_seasons_episodes.py
  • tests/scripts/test_verify_external_runtime_contracts.py
  • tests/scripts/test_verify_modal_readiness.py
  • tests/socials/instagram/comments_scrapling/test_completeness_unknown_count.py
  • tests/socials/instagram/comments_scrapling/test_counts_expected_reply.py
  • tests/socials/instagram/comments_scrapling/test_expected_count_degraded.py
  • tests/socials/instagram/comments_scrapling/test_fetcher_status_only.py
  • tests/socials/instagram/comments_scrapling/test_persistence.py
  • tests/socials/instagram/comments_scrapling/test_proxy.py
  • tests/socials/instagram/comments_scrapling/test_public_blocked_pause.py
  • tests/socials/instagram/comments_scrapling/test_public_decodo_isolation.py
  • tests/socials/instagram/comments_scrapling/test_public_relay_env_knobs.py
  • tests/socials/instagram/posts_scrapling/test_fetcher.py
  • tests/socials/instagram/posts_scrapling/test_job_runner.py
  • tests/socials/instagram/posts_scrapling/test_proxy.py
  • tests/socials/instagram/test_instagram_public_mode.py
  • tests/socials/instagram/test_instagram_public_post_extractor.py
  • tests/socials/instagram/test_instagram_public_probe_script.py
  • tests/socials/instagram/test_network_policy.py
  • tests/socials/test_cookie_refresh_flows.py
  • tests/socials/test_instagram_comments_date_window.py
  • tests/socials/test_instagram_comments_progress_payload.py
  • tests/socials/test_instagram_comments_scrapling.py
  • tests/socials/test_instagram_comments_scrapling_pagination.py
  • tests/socials/test_instagram_comments_scrapling_retry.py
  • tests/socials/test_socialblade_auth.py
  • tests/socials/test_socialblade_scraper.py
  • tests/socials/threads/posts_scrapling/test_job_runner.py
  • tests/socials/threads/posts_scrapling/test_persistence.py
  • tests/socials/tiktok/posts_scrapling/test_fetcher.py
  • tests/socials/tiktok/test_tiktok_direct_scrape.py
  • tests/socials/youtube/test_youtube_direct_scrape.py
  • tests/test_modal_dispatch.py
  • tests/test_modal_jobs.py
  • tests/utils/test_lazy_imports.py
  • trr_backend/db/session.py
  • trr_backend/integrations/bravo_jsonapi.py
  • trr_backend/integrations/getty_local_prefetch.py
  • trr_backend/integrations/imdb/fullcredits_cast_parser.py
  • trr_backend/integrations/imdb/title_metadata_client.py
  • trr_backend/integrations/tmdb/client.py
  • trr_backend/modal_dispatch.py
  • trr_backend/modal_jobs.py
  • trr_backend/repositories/admin_reddit_sources.py
  • trr_backend/repositories/admin_runtime_settings.py
  • trr_backend/repositories/admin_show_reads.py
  • trr_backend/repositories/episodes.py
  • trr_backend/repositories/people.py
  • trr_backend/repositories/reddit_flair_categorizer.py
  • trr_backend/repositories/reddit_refresh.py
  • trr_backend/socials/account_browser_sessions.py
  • trr_backend/socials/browser_cookie_refresh.py
  • trr_backend/socials/control_plane/dispatch_runtime.py
  • trr_backend/socials/control_plane/run_lifecycle.py
  • trr_backend/socials/facebook/cookie_refresh.py
  • trr_backend/socials/instagram/auth_cooldown.py
  • trr_backend/socials/instagram/catalog_ingest.py
  • trr_backend/socials/instagram/comments_scrapling/counts.py
  • trr_backend/socials/instagram/comments_scrapling/fetcher.py
  • trr_backend/socials/instagram/comments_scrapling/job_runner.py
  • trr_backend/socials/instagram/comments_scrapling/persistence.py
  • trr_backend/socials/instagram/comments_scrapling/proxy.py
  • trr_backend/socials/instagram/comments_scrapling/public_mode.py
  • trr_backend/socials/instagram/network_policy.py
  • trr_backend/socials/instagram/permalink_metadata.py
  • trr_backend/socials/instagram/posts_scrapling/fetcher.py
  • trr_backend/socials/instagram/posts_scrapling/job_runner.py
  • trr_backend/socials/instagram/posts_scrapling/proxy.py
  • trr_backend/socials/instagram/public_post_extractor.py
  • trr_backend/socials/instagram/public_probe.py
  • trr_backend/socials/instagram/scraper.py
  • trr_backend/socials/media_url_safety.py
  • trr_backend/socials/pipelines/account_catalog/launch.py
  • trr_backend/socials/pipelines/comments/__init__.py
  • trr_backend/socials/pipelines/comments/instagram.py
  • trr_backend/socials/read_models/account_profile/common.py
  • trr_backend/socials/social_season_analytics_impl.py
  • trr_backend/socials/socialblade/auth.py
  • trr_backend/socials/socialblade/scraper.py
  • trr_backend/socials/threads/cookie_refresh.py
  • trr_backend/socials/threads/posts_scrapling/job_runner.py
  • trr_backend/socials/threads/posts_scrapling/persistence.py
  • trr_backend/socials/threads/scraper.py
  • trr_backend/socials/tiktok/posts_scrapling/fetcher.py
  • trr_backend/socials/tiktok/scraper.py
  • trr_backend/socials/twitter/scraper.py
  • trr_backend/socials/youtube/scraper.py
  • trr_backend/utils/lazy_imports.py

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/publish/trr-backend/20260616-takeover-all

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Exhaustive Code Review

Findings

  1. Critical - header-only internal-admin bypass can grant admin access
    api/auth.py and api/auth.py accept x-trr-local-admin-proxy plus loopback client.host and Host as sufficient for InternalAdminUser. The identity is then built from caller-controlled x-trr-admin-* headers at api/auth.py. Any reverse proxy/container/ASGI front that presents traffic as loopback or normalizes Host to loopback can turn external spoofed headers into full internal admin access across many routes. The smallest fix is to remove this bypass, or gate it behind an explicit dev-only env flag that is impossible in production and still require a signed internal token/shared secret.

  2. High - Supabase migration will fail in transactional migration/reset workflows
    supabase/migrations/20260610173000_social_comment_source_account_indexes.sql uses CREATE INDEX CONCURRENTLY throughout the checked-in migration. Postgres rejects concurrent index creation inside a transaction, and this repo already documents that checked-in Supabase migrations intentionally strip CONCURRENTLY for reset/push workflows at supabase/migrations/20260420180200_fk_index_hardening_wave_1.sql. Fix by using non-concurrent CREATE INDEX IF NOT EXISTS in the migration, with any production concurrent apply handled separately outside the Supabase transaction.

  3. High - new rollup trigger can break comment writes because the rollup table has no service-role privileges
    supabase/migrations/20260610190000_instagram_post_comment_rollups.sql creates social.instagram_post_comment_rollups, and the trigger calls a function that inserts/updates it at supabase/migrations/20260610190000_instagram_post_comment_rollups.sql. Unlike adjacent social migrations, it never grants service_role privileges. Once the trigger is installed on social.instagram_comments at supabase/migrations/20260610190000_instagram_post_comment_rollups.sql, normal backend comment upserts can fail with permission errors on the triggered rollup write. Add explicit grant all privileges on table social.instagram_post_comment_rollups to service_role, and enable/define RLS policy or SECURITY DEFINER with a pinned search_path if the function is intended to run independently of caller table grants.

  4. High - public Instagram comment jobs can finish as completed while retryable gaps remain
    Public comments mode is the default when no explicit mode/env is set at trr_backend/socials/instagram/comments_scrapling/public_mode.py, and the launch path defaults to public_relay at trr_backend/socials/pipelines/comments/instagram.py. When public mode finds retryable incomplete targets, the runner records public_comments_requires_approval but does not raise or requeue at trr_backend/socials/instagram/comments_scrapling/job_runner.py, then calls finish_job(..., status="completed") at trr_backend/socials/instagram/comments_scrapling/job_runner.py. Downstream status consumers will treat incomplete comment coverage as successful. Fix by marking these jobs/runs blocked or failed-with-action, or enqueueing an authenticated recovery path; do not report completed while retryable targets remain.

  5. Medium - rollup trigger adds row-by-row full recount write amplification
    The rollup refresh does count(*) over all comments for the post at supabase/migrations/20260610190000_instagram_post_comment_rollups.sql, and the trigger runs for each row at supabase/migrations/20260610190000_instagram_post_comment_rollups.sql. The persistence path batch-upserts many comments via _pg_upsert_many at trr_backend/socials/instagram/comments_scrapling/persistence.py, so a large post can trigger repeated full recounts during one scrape. Use a statement-level trigger with transition tables, an incremental counter update, or a deferred/batched rebuild after persistence.

  6. Medium - admin_runtime_settings service-role policy is ineffective without table privileges
    supabase/migrations/20260610203500_admin_runtime_settings.sql enables RLS and creates a service_role policy, but never grants SQL privileges on the new table. RLS policies do not grant access by themselves, so service-role/PostgREST access can still fail. Add grant all privileges on table core.admin_runtime_settings to service_role and explicit revokes for public, anon, authenticated if it is intended to stay backend-only.

Validation

I did not modify files. I ran the requested diff boundary commands and targeted inspection. git diff --check fails with two whitespace warnings: blank line at EOF in tests/socials/instagram/test_network_policy.py and trr_backend/socials/instagram/network_policy.py. No test suite was run.

…etadata

Completes WS2's bug#2 wiring: fetch_comments_for_shortcode now accepts the
degraded-expected-counts signal threaded from the job runner and stamps
result.diagnostic_metadata['expected_count_unknown']=True, which the
completeness guard reads to keep the post retryable. Thin wrapper delegates to
_fetch_comments_for_shortcode_impl so all return paths are covered.
Suite: 100 passed, 5 pre-existing DB-unavailable failures (unchanged).
- _normalize_comment_date_window (ISO 8601, inclusive start / exclusive end, UTC)
  + _comment_date_window_predicate helpers
- thread date_start/date_end through preview, target/incomplete shortcodes, start,
  resume; store date_start/date_end/target_window in run + shard job config
- inject 'p.posted_at >= %s AND < %s' into owner + collaborator/catalog target SQL
  (unchanged when no window) so enumeration only scans the requested window
- request schema: date_start/date_end/comments_worker_count/comments_target_batch_size
- 11 new date-window unit tests; 43 socials route tests pass

Copy link
Copy Markdown
Owner Author

Resolved the Codex exhaustive review blockers in 65233a1: removed the header-only local admin proxy bypass, made the checked-in index migration transactional, added service-role grants/RLS for the new settings and rollup tables, converted the Instagram comment rollup refresh to statement-level transition-table triggers, and changed public-comment approval gaps to fail with action metadata instead of completing. Local validation: git diff --check, tests/api/test_auth.py, focused comments scraper tests, and the broader backend regression subset (266 passed). GitHub CI is green on the updated head.

@therealityreport therealityreport merged commit cb5efbf into main Jun 16, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants