The Reality Report Backend Data Pipeline

The Reality Report backend data pipeline for reality TV show and cast information. This system automatically collects, enriches, and curates comprehensive data about reality TV shows and their cast members from multiple sources.

🎯 Overview

The TRR Backend Data Pipeline is a Supabase-first data processing system that transforms raw data from APIs and web sources into a structured, production-ready dataset for The Reality Report platform. It supports both direct sync scripts and a resumable pipeline orchestrator.

Key Features

Supabase-first storage: Normalized core schema in Postgres
Multi-Source Data Collection: TMDb, IMDb, Fandom Wikis, Famous Birthdays
AI-Powered Enrichment: Gemini AI for text analysis and gap filling
Resumable orchestration: Pipeline run tracking + stage-level resume
Scalable Processing: Handles 10,000+ cast members and 1,000+ shows

🚀 Quick Start

Prerequisites

Python 3.11 or higher
Supabase project URL + Service Role key
API keys for TMDb, IMDb, and Gemini AI
Optional: AWS credentials for S3 media mirroring

Installation

Clone the repository

git clone https://github.com/therealityreport/trr-backend.git
cd trr-backend

Install dependencies

python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set up environment

# Copy example environment file
cp .env.example .env

# Edit .env with your API keys and credentials
nano .env

Add credentials

Set the runtime Postgres URL in .env: TRR_DB_URL (Supavisor session mode on pooler.supabase.com:5432)
Optional secondary DSN: set TRR_DB_FALLBACK_URL to another session-mode pooler URL (pooler.supabase.com:5432); both TRR_DB_URL (primary) and TRR_DB_FALLBACK_URL (secondary) must use session pooler :5432
Set auth/runtime secrets in .env: SUPABASE_JWT_SECRET, TRR_INTERNAL_ADMIN_SHARED_SECRET
Optional Supabase JWT overrides when the project ref cannot be derived cleanly: SUPABASE_PROJECT_REF and SUPABASE_JWT_ISSUER
Set API keys in .env: TMDB_BEARER_TOKEN (or TMDB_API_KEY), TVDB_API_KEY, IMDB_API_KEY, GEMINI_API_KEY
Optional object storage for hosted assets and mirroring: OBJECT_STORAGE_PROVIDER, OBJECT_STORAGE_BUCKET, OBJECT_STORAGE_REGION, OBJECT_STORAGE_ENDPOINT_URL, OBJECT_STORAGE_ACCESS_KEY_ID, OBJECT_STORAGE_SECRET_ACCESS_KEY, OBJECT_STORAGE_PUBLIC_BASE_URL

Verify environment

# Check that all dependencies are correctly installed
make doctor

🧰 DB Sync Scripts

These scripts read the list of shows from core.shows (Supabase) and update tables directly. They load .env from the repo root.

# Shows (metadata + entities + watch providers)
PYTHONPATH=. python scripts/sync/sync_shows_all.py --all --verbose

# Seasons + episodes
PYTHONPATH=. python scripts/sync/sync_seasons_episodes.py --all --verbose

# People + cast/credits
PYTHONPATH=. python scripts/sync/sync_people.py --all --verbose

# Show images
PYTHONPATH=. python scripts/sync/sync_show_images.py --all --verbose

# Season/episode images
PYTHONPATH=. python scripts/sync/sync_season_episode_images.py --all --verbose

# People photos (multi-source)
PYTHONPATH=. python scripts/sync/sync_cast_photos.py --imdb-person-id nm11883948 --verbose

# BRAVOTV image runs and gallery media
PYTHONPATH=. python scripts/media/bravotv_get_images.py --show "Summer House" --season 11 --output /tmp/bravotv-summer-house-s11
PYTHONPATH=. python scripts/media/bravotv_get_images.py --person "Amanda Batula" --sources all --output /tmp/bravotv-amanda-batula
PYTHONPATH=. python scripts/media/sync_bravotv_galleries.py --show "Summer House" --person-name "Amanda Batula" --apply

# TMDb resolution + backfill (shows)
PYTHONPATH=. python scripts/sync/resolve_tmdb_ids_via_find.py --all --verbose
PYTHONPATH=. python scripts/backfill/backfill_tmdb_show_details.py --all --verbose

# TMDb entities (networks, production companies) + object-storage logo mirroring
PYTHONPATH=. python scripts/sync/sync_tmdb_show_entities.py --all --verbose

# TMDb watch providers + object-storage logo mirroring
PYTHONPATH=. python scripts/sync/sync_tmdb_watch_providers.py --all --verbose

Legacy composite runner:

python -m scripts.sync_all_tables --all
python -m scripts.sync_all_tables --tables shows,episodes,episode_appearances --imdb-id tt1234567

Common filters: --show-id, --tmdb-id, --imdb-id, --limit, --dry-run, --verbose.

Media mirroring uses the backend OBJECT_STORAGE_* contract. Production defaults are Cloudflare R2 with OBJECT_STORAGE_PROVIDER=r2, OBJECT_STORAGE_REGION=auto, and a public base URL such as https://media.thereality.report. Optional: OBJECT_STORAGE_PROFILE. Runtime DB access uses TRR_DB_URL and optional TRR_DB_FALLBACK_URL. Tooling can still accept legacy DB envs where explicitly documented.

TMDb backfill flow: resolve missing tmdb_id via /find using IMDb ids, then backfill /tv/{id} details into core.shows (typed columns + tmdb_meta). Both scripts are idempotent; omit --all for incremental updates. See docs/architecture.md for the full TMDb enrichment pipeline documentation.

Incremental/resume flags: --incremental/--no-incremental, --resume/--no-resume, --force, --since. Incremental mode uses core.sync_state + shows.most_recent_episode to skip unchanged shows and retry failures. After seasons/episodes sync, shows.show_total_seasons is normalized to the count of seasons with season_number > 0. Per-show progress is stored in core.sync_state (one row per show + table).

🧭 Pipeline Orchestrator (Resumable)

The pipeline orchestrator records runs and stages in the pipeline schema and supports resume-by-hash.

python -m trr_backend.cli pipeline run --all --verbose
python -m trr_backend.cli pipeline list
python -m trr_backend.cli pipeline status <run-id>

See docs/architecture/pipeline.md for details.

🔐 Security

Supabase JWT verification is local-only. The backend validates tokens with SUPABASE_JWT_SECRET, derives the expected project issuer from SUPABASE_PROJECT_REF or the Supabase/runtime URLs, and still accepts legacy service_role JWTs with iss="supabase" when the signature and ref match.

Never commit API keys, AWS credentials, or private keys. Rotate any exposed credentials immediately.

📦 Repo Layout

api/: FastAPI app (Supabase-backed API + WebSockets)
trr_backend/: Shared library code (reused by API + pipeline)
- trr_backend/integrations/: External metadata clients (IMDb/TMDb/etc.)
- trr_backend/media/bravotv/: BRAVOTV image-run pipeline, media candidate normalization, run review artifacts, and admin review action services
- trr_backend/pipeline/: Pipeline orchestration logic
- trr_backend/cli/: CLI entrypoints (Typer)
scripts/: Data sync scripts and utilities
supabase/: Database schema, migrations, and seeds
docs/: Architecture and operating docs

For detailed repository structure, module dependency graphs, and architecture diagrams, see docs/Repository/README.md.

📁 Architecture

The backend uses a Supabase database with data sync scripts that fetch from external APIs and populate the database.

Current Architecture:

Data stored in Supabase PostgreSQL (core.* schema)
Sync scripts in scripts/ fetch from TMDb, IMDb, Fandom wikis
Resumable pipeline orchestrator in trr_backend/pipeline/
FastAPI app in api/ serves data to the frontend

See docs/architecture.md for detailed architecture documentation.

Note: The legacy numbered pipeline (1-ShowInfo/, 2-CastInfo/, etc.) has been removed. Git history preserves these files if needed for reference. Current data ingestion uses the DB Sync Scripts documented above.

📊 Data Sources

Source	Purpose	Data Type	Rate Limits
TMDb API	Primary show and cast metadata	Shows, Cast, Episodes	40 requests/10s
IMDb API	Episode details and additional cast info	Episodes, Credits	1000 requests/day
Fandom Wikis	Reality show-specific cast details	Cast bios, Show info	Respectful scraping
Famous Birthdays	Biographical data	Birthdays, Zodiac signs	Rate limited
Gemini AI	Text analysis and gap filling	Guest names, Descriptions	1000 requests/minute

🔧 Configuration

Environment Variables

Copy .env.example to .env (never commit .env):

# API Keys
TMDB_BEARER_TOKEN=your_tmdb_bearer_token
TMDB_API_KEY=your_tmdb_api_key
TVDB_API_KEY=your_tvdb_api_key
IMDB_API_KEY=your_imdb_api_key  
GEMINI_API_KEY=your_gemini_api_key

# Supabase Configuration
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
TRR_DB_URL=postgresql://postgres.<project>:password@aws-1-us-east-1.pooler.supabase.com:5432/postgres
TRR_DB_FALLBACK_URL=
# Optional secondary DSN. Must also be a session-mode pooler URL (pooler.supabase.com:5432).

# Optional Configuration
REALITEASE_TMDB_SHOW_LIMIT=5
GEMINI_MODEL=gemini-2.5-flash
GEMINI_MODEL_FAST=gemini-2.5-flash
GEMINI_MODEL_PRO=gemini-2.5-pro
GOOGLE_GEMINI_MODEL=gemini-2.5-flash
# Deprecated fallback alias (temporary)
GEMINI-MODEL=gemini-2.5-flash

# Legacy Google Sheets (archived)
SPREADSHEET_NAME=Realitease2025Data
GOOGLE_APPLICATION_CREDENTIALS=keys/service-account.json

Supabase Structure

For the authoritative schema, see docs/db/schema.md and docs/architecture.md.

🛠️ Development

Running Individual Steps

Each sync stage can be run independently with various options:

# Import shows from lists
PYTHONPATH=. python scripts/import/import_shows_from_lists.py --imdb-list ... --tmdb-list ...

# Enrich shows (TMDb metadata + entities + providers)
PYTHONPATH=. python scripts/sync/sync_shows_all.py --all --verbose

# Seasons + episodes
PYTHONPATH=. python scripts/sync/sync_seasons_episodes.py --all --verbose

# People + cast
PYTHONPATH=. python scripts/sync/sync_people.py --all --verbose

Monitoring Progress

Logs: Check ../artifacts/trr-backend/logs/ (or the logs symlink) for execution logs and results
Dry Run: Use --dry-run flag to preview changes before writing
Limited Processing: Use --limit to test with smaller datasets
Progress Tracking: Most scripts provide detailed progress output

Caching

The system includes comprehensive caching for efficiency:

Gemini Responses: Cached in ../artifacts/trr-backend/.cache/ (or the .cache symlink)
TMDb API: Per-session caching to respect rate limits
IMDb Data: Local caching for repeated requests
Custom Cache: Use --cache-file to specify custom locations

📈 Data Flow

External APIs → Ingestion Scripts → Supabase → API/Exports
     ↓                 ↓             ↓          ↓
TMDb/IMDb/Fandom   import/sync     core.*     FastAPI + S3

🔍 Quality Assurance

Data Validation

Type Checking: Validates data types and formats
Required Fields: Ensures all required fields are present
Consistency Checks: Verifies data consistency across sources
Conflict Resolution: Handles data conflicts intelligently

Error Handling

Comprehensive Logging: Detailed logs for all operations
Graceful Failures: Handles API failures without data loss
Retry Logic: Automatic retry for transient failures
Recovery Mechanisms: Data recovery capabilities

Performance Monitoring

Processing Speed: 1,000+ records per hour
Error Rates: < 1% processing errors
API Compliance: 100% rate limit compliance
Resource Usage: Efficient memory and CPU usage

📂 Project Structure

TRR-Backend/
├── api/                       # FastAPI application
├── trr_backend/               # Shared library code
│   ├── db/                    # Database utilities
│   ├── integrations/          # External API clients (IMDb, TMDb, etc.)
│   ├── ingestion/             # Data ingestion modules
│   └── repositories/          # Database access layer
├── scripts/                   # Data sync and utility scripts
│   ├── dev/                   # Development tools (doctor.py)
│   ├── db/                    # Database SQL scripts
│   └── supabase/              # Supabase-specific utilities
├── supabase/                  # Database schema and migrations
│   ├── migrations/            # SQL migrations
│   └── schema_docs/           # Auto-generated schema documentation
├── tests/                     # Test suite
├── docs/                      # Documentation
├── requirements.txt           # Python dependencies
└── Makefile                   # Common development tasks

🚀 Deployment

Local Development

# Install dependencies
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Set up environment
cp .env.example .env
# Edit .env with your credentials

# Verify setup
make doctor

# Run tests
pytest

Cloud Deployment

See docs/cloud/ for detailed cloud deployment guides.

🔐 Security

Rotate Secrets: This repo previously tracked a .env file. Assume any keys in it are compromised and rotate them.
Never Commit .env: Local .env files are gitignored; use .env.example as the template.
Credentials: Keep service account JSONs under keys/ (gitignored) or inject via CI secrets.
Generated Output: Runtime artifacts live outside the repo root in ../artifacts/trr-backend/ (e.g. logs/, .cache/, debug_html/, out/). Use symlinks in the repo root for convenience.
Optional History Purge: If you need to remove leaked secrets from git history, rotate keys first, then use a history-rewrite tool and force-push.

📝 Documentation

PRD: See PRD.md for comprehensive product requirements
Architecture: See docs/architecture.md for a high-level system overview
Pipeline Orchestration: See docs/architecture/pipeline.md for staged runs and resume logic
DB Schema: See docs/db/schema.md for core tables and views
Setup Guides: See docs/cloud/ for deployment documentation
Local Development: See docs/README_local.md for additional setup notes
API Docs UI: See docs/api/run.md for /docs, /redoc, and /openapi.json
Legacy Google Sheets Pipeline: See docs/legacy/google_sheets_pipeline.md

🤝 Contributing

See CONTRIBUTING.md.

Fork the repository
Make your changes (default: main; this repo does not enforce a branch naming convention)
Commit your changes (git commit -m 'Describe your change')
Push your changes
Open a Pull Request

Development Guidelines

Follow Python PEP 8 style guidelines
Add comprehensive error handling
Include detailed logging
Test with dry-run mode first
Update documentation for new features

📊 Performance Metrics

Data Volume: 10,000+ cast members, 1,000+ shows
Processing Speed: 1,000+ records per hour
Accuracy: 98%+ accuracy compared to source APIs
Uptime: 99%+ availability
Error Rate: < 1% processing errors

🐛 Troubleshooting

Common Issues

Supabase Connectivity

# Verify environment and connectivity
make doctor

Schema Cache Issues

# Reload PostgREST schema cache
bash scripts/reload_postgrest_schema.sh

Data Parity Checks

# Validate credits/media parity
PYTHONPATH=. python scripts/verify/verify_credits_parity.py
PYTHONPATH=. python scripts/verify/verify_media_unification.py

Getting Help

Issues: Create an issue on GitHub
Documentation: Check docs/ directory
Logs: Review logs/ directory for error details
Dry Run: Use --dry-run flag to test changes

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

TMDb: For comprehensive movie and TV database
IMDb: For detailed episode and cast information
Google: For Gemini AI
Fandom: For reality TV show wikis and community data

The Reality Report Backend Data Pipeline - Transforming reality TV data into actionable insights.

For questions or support, please open an issue or contact the development team.

Name		Name	Last commit message	Last commit date
Latest commit History 629 Commits
.agents/skills		.agents/skills
.claude		.claude
.config		.config
.github/workflows		.github/workflows
.logs		.logs
.plan-grader		.plan-grader
api		api
docs		docs
scripts		scripts
supabase		supabase
tests		tests
trr_backend		trr_backend
.coderabbit.yaml		.coderabbit.yaml
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.gitleaksignore		.gitleaksignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.tmp-backfill-monitor.py		.tmp-backfill-monitor.py
AGENTS.md		AGENTS.md
BRANCHING_STRATEGY.md		BRANCHING_STRATEGY.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
backfill_tmdb_show_details.py		backfill_tmdb_show_details.py
bug_report.md		bug_report.md
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
render.yaml		render.yaml
requirements.in		requirements.in
requirements.lock.txt		requirements.lock.txt
requirements.modal.browser.in		requirements.modal.browser.in
requirements.modal.browser.lock.txt		requirements.modal.browser.lock.txt
requirements.modal.lean.in		requirements.modal.lean.in
requirements.modal.lean.lock.txt		requirements.modal.lean.lock.txt
requirements.modal.vision.in		requirements.modal.vision.in
requirements.modal.vision.lock.txt		requirements.modal.vision.lock.txt
requirements.txt		requirements.txt
resolve_tmdb_ids_via_find.py		resolve_tmdb_ids_via_find.py
ruff.toml		ruff.toml
start-api.sh		start-api.sh
test_connection.py		test_connection.py

Folders and files

Latest commit

History

Repository files navigation

The Reality Report Backend Data Pipeline

🎯 Overview

Key Features

🚀 Quick Start

Prerequisites

Installation

🧰 DB Sync Scripts

🧭 Pipeline Orchestrator (Resumable)

🔐 Security

📦 Repo Layout

📁 Architecture

📊 Data Sources

🔧 Configuration

Environment Variables

Supabase Structure

🛠️ Development

Running Individual Steps

Monitoring Progress

Caching

📈 Data Flow

🔍 Quality Assurance

Data Validation

Error Handling

Performance Monitoring

📂 Project Structure

🚀 Deployment

Local Development

Cloud Deployment

🔐 Security

📝 Documentation

🤝 Contributing

Development Guidelines

📊 Performance Metrics

🐛 Troubleshooting

Common Issues

Getting Help

📄 License

🙏 Acknowledgments

About

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages