The DebateFlow benchmark spec (plans/SPEC.md) defines a benchmark for evaluating LLM debate judgment. Before any evaluation can happen, we need debates to judge. This plan covers the first infrastructure piece: a synthetic debate generator that produces 4-turn transcripts with controlled asymmetries and control debates.
debateflow/
├── pyproject.toml
├── .env.example # API key template
├── resolutions.yaml # Seed resolutions
├── plans/ # Specs and planning docs
│ ├── PLAN.md
│ ├── SPEC.md
│ ├── VOICE-SPEC.md
│ └── TELEGRAM-JUDGING-SPEC.md
├── src/debateflow/
│ ├── models.py # Pydantic data models
│ ├── providers.py # LLM provider factory (Anthropic + OpenAI)
│ ├── prompts.py # System prompts + constraint injection templates
│ ├── generator.py # Core 4-turn generation pipeline
│ ├── compile.py # JSONL compilation + stats
│ ├── publish.py # HuggingFace Hub publication
│ ├── dataset_card.py # Dataset card template generation
│ ├── cli.py # Typer CLI entry point
│ ├── voice.py # ElevenLabs TTS voice synthesis
│ ├── telegram_judging.py # Telegram judging session management
│ ├── server.py # Web annotation server
│ └── static/ # Web UI (annotate + review)
├── output/ # Generated debates (gitignored)
│ └── debates/ # Individual JSON files
└── tests/
├── test_models.py
└── test_prompts.py
class DebateCategory(str, Enum): # policy | values | empirical
class WeaknessType(str, Enum): # weak_evidence | argument_dropping | logical_gaps | burden_of_proof
class Side(str, Enum): # aff | neg
class Turn(BaseModel):
speaker: Side
role: str # opening | response | rebuttal | closing
text: str
class ModelConfig(BaseModel):
provider: str # "anthropic" | "openai"
model_name: str # e.g. "claude-sonnet-4-20250514"
temperature: float = 0.7
class ConstraintInfo(BaseModel):
type: WeaknessType | None = None # None = control debate
target_side: Side | None = None
class DebateMetadata(BaseModel):
debate_id: str # truncated UUID (8 chars)
resolution: str
category: DebateCategory
aff_model: ModelConfig
neg_model: ModelConfig
constraint: ConstraintInfo
is_control: bool
generated_at: datetime
generator_version: str = "0.1.0"
class Debate(BaseModel):
metadata: DebateMetadata
turns: list[Turn] # exactly 4Each debate is self-contained JSON with full reproducibility metadata.
| Turn | Speaker | Role | Constraint applies? |
|---|---|---|---|
| 0 | Aff | opening | If Aff is constrained |
| 1 | Neg | response | If Neg is constrained |
| 2 | Aff | rebuttal | If Aff is constrained |
| 3 | Neg | closing | If Neg is constrained |
Exception: argument_dropping only applies to response/closing turns (need opponent arguments to drop).
- Pick resolution, constraint type, target side (or mark as control)
- For each of the 4 turns:
- Build system prompt for that side (base + optional weakness overlay)
- Build user prompt with resolution + all previous speeches as context
- Call the appropriate LLM (Aff model or Neg model)
- Append speech text to transcript
- Assemble
Debateobject, write as individual JSON tooutput/debates/
Turns are sequential within a debate (each depends on prior turns). Batch generation is also sequential to avoid rate-limit complexity at pilot scale.
Factory that creates pydantic-ai Agent instances from ModelConfig. Supports both AnthropicModel and OpenAIModel via pydantic-ai's provider abstraction. API keys from environment variables.
Each side gets a brief system prompt: argue for/against the resolution, no meta-commentary, 200–400 words per turn.
Per-role instructions appended to user prompt:
- opening: Present strongest arguments, establish framework
- response: Engage opponent's opening, refute and counter-argue
- rebuttal: Defend arguments, expose opponent weaknesses
- closing: Summarize, weigh key arguments
Appended to system prompt on constrained side's turns:
- weak_evidence: Rely on anecdotes, vague authorities, hedging. Structure coherent but evidence weak.
- argument_dropping: Ignore 1–2 of opponent's key arguments. Don't acknowledge the gap.
- logical_gaps: Include 1–2 fallacies (hasty generalization, false dichotomy, non-sequitur). Surface rhetoric confident.
- burden_of_proof: Assert without support, demand opponent disprove. "Unless they can show otherwise..."
These are the quality-critical prompts. Calibrated for "noticeable by an attentive judge" — not comically bad.
Three commands via Typer:
# Generate debates
uv run python cli.py generate -n 10 \
--aff-provider anthropic --aff-model claude-sonnet-4-20250514 \
--neg-provider openai --neg-model gpt-4o \
--control-ratio 0.2
# Generate with specific resolution or category
uv run python cli.py generate -n 5 --category values
uv run python cli.py generate -n 1 -r "This house would ban private cars in city centers"
# Compile to JSONL
uv run python cli.py compile
# Show dataset stats (weakness distribution, category balance, side balance)
uv run python cli.py statsModel defaults loaded from resolutions.yaml so bare generate -n 10 works out of the box.
12 seed resolutions (4 per category). Default model configs for both sides. Example:
resolutions:
- text: "This house would ban private car ownership in city centers"
category: policy
# ... 11 more
defaults:
aff:
provider: anthropic
model_name: claude-sonnet-4-20250514
temperature: 0.7
neg:
provider: anthropic
model_name: claude-sonnet-4-20250514
temperature: 0.7compile_to_jsonl(): Read alloutput/debates/*.json, validate with pydantic, write one line per debate tooutput/debateflow.jsonlshow_stats(): Print counts by weakness type, category, constrained side, control vs. constrained
pydantic>=2.0.0
pydantic-ai>=1.39.0
typer>=0.12.0
rich>=13.0.0
pyyaml>=6.0
python-dotenv>=1.0.0
datasets>=3.0.0
huggingface_hub>=0.25.0
A HuggingFace dataset repo containing:
data/debateflow.jsonl— the compiled dataset (one debate per line)README.md— dataset card with YAML metadata header
---
language:
- en
license: cc-by-4.0
task_categories:
- text-classification
tags:
- debate
- argumentation
- benchmark
- llm-as-judge
pretty_name: "DebateFlow"
size_categories:
- n<1K # update when final count known
---The README.md follows HuggingFace's standard template:
- Dataset Description — what DebateFlow is, citation info
- Dataset Structure — schema description (fields, types, enums), example instance
- Dataset Creation — synthetic generation methodology, resolution categories, constraint types
- Considerations for Using the Data — synthetic data limitations, stylistic homogeneity, English-only
- Additional Information — license, author, link to SPEC.md and paper (if any)
# Compile JSONL, generate dataset card, push to HuggingFace Hub
uv run python cli.py publish --repo spodhajsky/debateflow
# Dry run — generate card + JSONL locally without pushing
uv run python cli.py publish --repo spodhajsky/debateflow --dry-rundef publish(repo_id: str, input_dir: Path, dry_run: bool = False):
# 1. Compile debates to JSONL (reuses compile.py)
# 2. Load JSONL as HuggingFace Dataset
dataset = Dataset.from_json(str(jsonl_path))
# 3. Generate dataset card from template + computed stats
# 4. Push to Hub (unless dry_run)
dataset.push_to_hub(repo_id, private=False)Requires datasets and huggingface_hub libraries. Auth via huggingface-cli login (token cached locally).
datasets>=3.0.0
huggingface_hub>=0.25.0
When the evaluation harness is built (running LLMs as judges on generated debates), the output should conform to the EvalEval "Every Eval Ever" schema:
- Aggregate JSON: Run-level metadata (which judge model, benchmark version, overall scores per dimension)
- Instance-level JSONL: Per-debate judge output (winner prediction, rubric scores, reasoning trace)
The debate generation format designed above is compatible — individual debates can be referenced as source_data in the EvalEval aggregate schema. No changes needed to the generation pipeline; this is purely an evaluation-harness concern.
pyproject.toml,uv sync,.env.example,.gitignoremodels.pywith all pydantic schemastests/test_models.py— serialization roundtrip
prompts.py— base prompts, turn instructions, 4 weakness templatesresolutions.yaml— 12 seed resolutionstests/test_prompts.py— verify prompt construction per weakness type
providers.py— Anthropic + OpenAI factorygenerator.py— single debate + batch generation- Manual test: generate 1 debate, inspect JSON
cli.py— generate / compile / stats commandscompile.py— JSONL compilation + stats display
dataset_card.py— template generation from computed statspublish.py— compile + push_to_hub wrapper- Add
publishcommand to CLI - Test with
--dry-runto verify card + JSONL without pushing
- Generate 5 debates with mixed constraints
- Run
compileandstats - Manually review 2–3 debates for constraint quality (is the weakness detectable but not cartoonish?)
publish --dry-runto verify dataset card renders correctly
After implementation:
uv run python cli.py generate -n 1— produces a single debate JSON inoutput/debates/uv run python cli.py generate -n 5 --control-ratio 0.2— produces ~4 constrained + ~1 controluv run python cli.py stats— shows balanced distributionuv run python cli.py compile— producesoutput/debateflow.jsonl- Manual read of 2–3 generated debates to assess constraint naturalness
uv run python cli.py publish --repo test/debateflow --dry-run— generates dataset card + JSONL locallyuv run pytest tests/— models and prompts pass