The Reality Report backend data pipeline for reality TV show and cast information. This system automatically collects, enriches, and curates comprehensive data about reality TV shows and their cast members from multiple sources.
The TRR Backend Data Pipeline is a Supabase-first data processing system that transforms raw data from APIs and web sources into a structured, production-ready dataset for The Reality Report platform. It supports both direct sync scripts and a resumable pipeline orchestrator.
- Supabase-first storage: Normalized core schema in Postgres
- Multi-Source Data Collection: TMDb, IMDb, Fandom Wikis, Famous Birthdays
- AI-Powered Enrichment: Gemini AI for text analysis and gap filling
- Resumable orchestration: Pipeline run tracking + stage-level resume
- Scalable Processing: Handles 10,000+ cast members and 1,000+ shows
- Python 3.11 or higher
- Supabase project URL + Service Role key
- API keys for TMDb, IMDb, and Gemini AI
- Optional: AWS credentials for S3 media mirroring
-
Clone the repository
git clone https://github.com/therealityreport/trr-backend.git cd trr-backend -
Install dependencies
python3.11 -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
Set up environment
# Copy example environment file cp .env.example .env # Edit .env with your API keys and credentials nano .env
-
Add credentials
- Set the runtime Postgres URL in
.env:TRR_DB_URL(Supavisor session mode onpooler.supabase.com:5432) - Optional secondary DSN: set
TRR_DB_FALLBACK_URLto another session-mode pooler URL (pooler.supabase.com:5432); bothTRR_DB_URL(primary) andTRR_DB_FALLBACK_URL(secondary) must use session pooler:5432 - Set auth/runtime secrets in
.env:SUPABASE_JWT_SECRET,TRR_INTERNAL_ADMIN_SHARED_SECRET - Optional Supabase JWT overrides when the project ref cannot be derived cleanly:
SUPABASE_PROJECT_REFandSUPABASE_JWT_ISSUER - Set API keys in
.env:TMDB_BEARER_TOKEN(orTMDB_API_KEY),TVDB_API_KEY,IMDB_API_KEY,GEMINI_API_KEY - Optional object storage for hosted assets and mirroring:
OBJECT_STORAGE_PROVIDER,OBJECT_STORAGE_BUCKET,OBJECT_STORAGE_REGION,OBJECT_STORAGE_ENDPOINT_URL,OBJECT_STORAGE_ACCESS_KEY_ID,OBJECT_STORAGE_SECRET_ACCESS_KEY,OBJECT_STORAGE_PUBLIC_BASE_URL
- Verify environment
# Check that all dependencies are correctly installed make doctor
These scripts read the list of shows from core.shows (Supabase) and update tables directly. They load .env from the repo root.
# Shows (metadata + entities + watch providers)
PYTHONPATH=. python scripts/sync/sync_shows_all.py --all --verbose
# Seasons + episodes
PYTHONPATH=. python scripts/sync/sync_seasons_episodes.py --all --verbose
# People + cast/credits
PYTHONPATH=. python scripts/sync/sync_people.py --all --verbose
# Show images
PYTHONPATH=. python scripts/sync/sync_show_images.py --all --verbose
# Season/episode images
PYTHONPATH=. python scripts/sync/sync_season_episode_images.py --all --verbose
# People photos (multi-source)
PYTHONPATH=. python scripts/sync/sync_cast_photos.py --imdb-person-id nm11883948 --verbose
# BRAVOTV image runs and gallery media
PYTHONPATH=. python scripts/media/bravotv_get_images.py --show "Summer House" --season 11 --output /tmp/bravotv-summer-house-s11
PYTHONPATH=. python scripts/media/bravotv_get_images.py --person "Amanda Batula" --sources all --output /tmp/bravotv-amanda-batula
PYTHONPATH=. python scripts/media/sync_bravotv_galleries.py --show "Summer House" --person-name "Amanda Batula" --apply
# TMDb resolution + backfill (shows)
PYTHONPATH=. python scripts/sync/resolve_tmdb_ids_via_find.py --all --verbose
PYTHONPATH=. python scripts/backfill/backfill_tmdb_show_details.py --all --verbose
# TMDb entities (networks, production companies) + object-storage logo mirroring
PYTHONPATH=. python scripts/sync/sync_tmdb_show_entities.py --all --verbose
# TMDb watch providers + object-storage logo mirroring
PYTHONPATH=. python scripts/sync/sync_tmdb_watch_providers.py --all --verboseLegacy composite runner:
python -m scripts.sync_all_tables --all
python -m scripts.sync_all_tables --tables shows,episodes,episode_appearances --imdb-id tt1234567Common filters: --show-id, --tmdb-id, --imdb-id, --limit, --dry-run, --verbose.
Media mirroring uses the backend OBJECT_STORAGE_* contract. Production defaults are Cloudflare R2 with OBJECT_STORAGE_PROVIDER=r2, OBJECT_STORAGE_REGION=auto, and a public base URL such as https://media.thereality.report. Optional: OBJECT_STORAGE_PROFILE.
Runtime DB access uses TRR_DB_URL and optional TRR_DB_FALLBACK_URL. Tooling can still accept legacy DB envs where explicitly documented.
TMDb backfill flow: resolve missing tmdb_id via /find using IMDb ids, then backfill /tv/{id} details into core.shows (typed columns + tmdb_meta). Both scripts are idempotent; omit --all for incremental updates. See docs/architecture.md for the full TMDb enrichment pipeline documentation.
Incremental/resume flags: --incremental/--no-incremental, --resume/--no-resume, --force, --since.
Incremental mode uses core.sync_state + shows.most_recent_episode to skip unchanged shows and retry failures.
After seasons/episodes sync, shows.show_total_seasons is normalized to the count of seasons with season_number > 0.
Per-show progress is stored in core.sync_state (one row per show + table).
The pipeline orchestrator records runs and stages in the pipeline schema and supports resume-by-hash.
python -m trr_backend.cli pipeline run --all --verbose
python -m trr_backend.cli pipeline list
python -m trr_backend.cli pipeline status <run-id>See docs/architecture/pipeline.md for details.
Supabase JWT verification is local-only. The backend validates tokens with SUPABASE_JWT_SECRET, derives the expected project issuer from SUPABASE_PROJECT_REF or the Supabase/runtime URLs, and still accepts legacy service_role JWTs with iss="supabase" when the signature and ref match.
Never commit API keys, AWS credentials, or private keys. Rotate any exposed credentials immediately.
api/: FastAPI app (Supabase-backed API + WebSockets)trr_backend/: Shared library code (reused by API + pipeline)trr_backend/integrations/: External metadata clients (IMDb/TMDb/etc.)trr_backend/media/bravotv/: BRAVOTV image-run pipeline, media candidate normalization, run review artifacts, and admin review action servicestrr_backend/pipeline/: Pipeline orchestration logictrr_backend/cli/: CLI entrypoints (Typer)
scripts/: Data sync scripts and utilitiessupabase/: Database schema, migrations, and seedsdocs/: Architecture and operating docs
For detailed repository structure, module dependency graphs, and architecture diagrams, see docs/Repository/README.md.
The backend uses a Supabase database with data sync scripts that fetch from external APIs and populate the database.
Current Architecture:
- Data stored in Supabase PostgreSQL (
core.*schema) - Sync scripts in
scripts/fetch from TMDb, IMDb, Fandom wikis - Resumable pipeline orchestrator in
trr_backend/pipeline/ - FastAPI app in
api/serves data to the frontend
See docs/architecture.md for detailed architecture documentation.
Note: The legacy numbered pipeline (
1-ShowInfo/,2-CastInfo/, etc.) has been removed. Git history preserves these files if needed for reference. Current data ingestion uses the DB Sync Scripts documented above.
| Source | Purpose | Data Type | Rate Limits |
|---|---|---|---|
| TMDb API | Primary show and cast metadata | Shows, Cast, Episodes | 40 requests/10s |
| IMDb API | Episode details and additional cast info | Episodes, Credits | 1000 requests/day |
| Fandom Wikis | Reality show-specific cast details | Cast bios, Show info | Respectful scraping |
| Famous Birthdays | Biographical data | Birthdays, Zodiac signs | Rate limited |
| Gemini AI | Text analysis and gap filling | Guest names, Descriptions | 1000 requests/minute |
Copy .env.example to .env (never commit .env):
# API Keys
TMDB_BEARER_TOKEN=your_tmdb_bearer_token
TMDB_API_KEY=your_tmdb_api_key
TVDB_API_KEY=your_tvdb_api_key
IMDB_API_KEY=your_imdb_api_key
GEMINI_API_KEY=your_gemini_api_key
# Supabase Configuration
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
TRR_DB_URL=postgresql://postgres.<project>:password@aws-1-us-east-1.pooler.supabase.com:5432/postgres
TRR_DB_FALLBACK_URL=
# Optional secondary DSN. Must also be a session-mode pooler URL (pooler.supabase.com:5432).
# Optional Configuration
REALITEASE_TMDB_SHOW_LIMIT=5
GEMINI_MODEL=gemini-2.5-flash
GEMINI_MODEL_FAST=gemini-2.5-flash
GEMINI_MODEL_PRO=gemini-2.5-pro
GOOGLE_GEMINI_MODEL=gemini-2.5-flash
# Deprecated fallback alias (temporary)
GEMINI-MODEL=gemini-2.5-flash
# Legacy Google Sheets (archived)
SPREADSHEET_NAME=Realitease2025Data
GOOGLE_APPLICATION_CREDENTIALS=keys/service-account.jsonFor the authoritative schema, see docs/db/schema.md and docs/architecture.md.
Each sync stage can be run independently with various options:
# Import shows from lists
PYTHONPATH=. python scripts/import/import_shows_from_lists.py --imdb-list ... --tmdb-list ...
# Enrich shows (TMDb metadata + entities + providers)
PYTHONPATH=. python scripts/sync/sync_shows_all.py --all --verbose
# Seasons + episodes
PYTHONPATH=. python scripts/sync/sync_seasons_episodes.py --all --verbose
# People + cast
PYTHONPATH=. python scripts/sync/sync_people.py --all --verbose- Logs: Check
../artifacts/trr-backend/logs/(or thelogssymlink) for execution logs and results - Dry Run: Use
--dry-runflag to preview changes before writing - Limited Processing: Use
--limitto test with smaller datasets - Progress Tracking: Most scripts provide detailed progress output
The system includes comprehensive caching for efficiency:
- Gemini Responses: Cached in
../artifacts/trr-backend/.cache/(or the.cachesymlink) - TMDb API: Per-session caching to respect rate limits
- IMDb Data: Local caching for repeated requests
- Custom Cache: Use
--cache-fileto specify custom locations
External APIs β Ingestion Scripts β Supabase β API/Exports
β β β β
TMDb/IMDb/Fandom import/sync core.* FastAPI + S3
- Type Checking: Validates data types and formats
- Required Fields: Ensures all required fields are present
- Consistency Checks: Verifies data consistency across sources
- Conflict Resolution: Handles data conflicts intelligently
- Comprehensive Logging: Detailed logs for all operations
- Graceful Failures: Handles API failures without data loss
- Retry Logic: Automatic retry for transient failures
- Recovery Mechanisms: Data recovery capabilities
- Processing Speed: 1,000+ records per hour
- Error Rates: < 1% processing errors
- API Compliance: 100% rate limit compliance
- Resource Usage: Efficient memory and CPU usage
TRR-Backend/
βββ api/ # FastAPI application
βββ trr_backend/ # Shared library code
β βββ db/ # Database utilities
β βββ integrations/ # External API clients (IMDb, TMDb, etc.)
β βββ ingestion/ # Data ingestion modules
β βββ repositories/ # Database access layer
βββ scripts/ # Data sync and utility scripts
β βββ dev/ # Development tools (doctor.py)
β βββ db/ # Database SQL scripts
β βββ supabase/ # Supabase-specific utilities
βββ supabase/ # Database schema and migrations
β βββ migrations/ # SQL migrations
β βββ schema_docs/ # Auto-generated schema documentation
βββ tests/ # Test suite
βββ docs/ # Documentation
βββ requirements.txt # Python dependencies
βββ Makefile # Common development tasks
# Install dependencies
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Set up environment
cp .env.example .env
# Edit .env with your credentials
# Verify setup
make doctor
# Run tests
pytestSee docs/cloud/ for detailed cloud deployment guides.
- Rotate Secrets: This repo previously tracked a
.envfile. Assume any keys in it are compromised and rotate them. - Never Commit
.env: Local.envfiles are gitignored; use.env.exampleas the template. - Credentials: Keep service account JSONs under
keys/(gitignored) or inject via CI secrets. - Generated Output: Runtime artifacts live outside the repo root in
../artifacts/trr-backend/(e.g.logs/,.cache/,debug_html/,out/). Use symlinks in the repo root for convenience. - Optional History Purge: If you need to remove leaked secrets from git history, rotate keys first, then use a history-rewrite tool and force-push.
- PRD: See
PRD.mdfor comprehensive product requirements - Architecture: See
docs/architecture.mdfor a high-level system overview - Pipeline Orchestration: See
docs/architecture/pipeline.mdfor staged runs and resume logic - DB Schema: See
docs/db/schema.mdfor core tables and views - Setup Guides: See
docs/cloud/for deployment documentation - Local Development: See
docs/README_local.mdfor additional setup notes - API Docs UI: See
docs/api/run.mdfor/docs,/redoc, and/openapi.json - Legacy Google Sheets Pipeline: See
docs/legacy/google_sheets_pipeline.md
See CONTRIBUTING.md.
- Fork the repository
- Make your changes (default:
main; this repo does not enforce a branch naming convention) - Commit your changes (
git commit -m 'Describe your change') - Push your changes
- Open a Pull Request
- Follow Python PEP 8 style guidelines
- Add comprehensive error handling
- Include detailed logging
- Test with dry-run mode first
- Update documentation for new features
- Data Volume: 10,000+ cast members, 1,000+ shows
- Processing Speed: 1,000+ records per hour
- Accuracy: 98%+ accuracy compared to source APIs
- Uptime: 99%+ availability
- Error Rate: < 1% processing errors
Supabase Connectivity
# Verify environment and connectivity
make doctorSchema Cache Issues
# Reload PostgREST schema cache
bash scripts/reload_postgrest_schema.shData Parity Checks
# Validate credits/media parity
PYTHONPATH=. python scripts/verify/verify_credits_parity.py
PYTHONPATH=. python scripts/verify/verify_media_unification.py- Issues: Create an issue on GitHub
- Documentation: Check
docs/directory - Logs: Review
logs/directory for error details - Dry Run: Use
--dry-runflag to test changes
This project is licensed under the MIT License - see the LICENSE file for details.
- TMDb: For comprehensive movie and TV database
- IMDb: For detailed episode and cast information
- Google: For Gemini AI
- Fandom: For reality TV show wikis and community data
The Reality Report Backend Data Pipeline - Transforming reality TV data into actionable insights.
For questions or support, please open an issue or contact the development team.