NovaVision generates an image from the emotion of a sentence, then checks whether the image actually conveys that emotion by recovering the emotion back from it with a swappable probe and comparing it to what you asked for. It is two things sharing one pipeline, a text-to-art web app and an emotion-controllability benchmark, built around the failure modes that make such a measurement easy to fake, with a frozen text benchmark (AffectBench) and a full write-up behind it.
- Run
make setupthenmake testto exercise the deterministic core with no model downloads, thenmake setup-mlandmake appto launch the web app at http://127.0.0.1:8000 and generate emotion-conditioned images interactively. make smokeis a quick end-to-end run (2 subjects, 1 seed);make pilotreproduces the committed CPU pilot (n=14 per tier).- Build the text benchmark with
make benchmark(AffectBench from GoEmotions), then run the text-conditioned track withmake text. - For a public deployment, bind explicitly with
NOVA_PUBLIC=1and setNOVA_API_TOKEN,NOVA_RATE_LIMIT, andNOVA_MAX_CONCURRENCY; the app binds 127.0.0.1 by default.
A CPU pilot: 256-px, 2 content subjects, single seed, n=14 per conditioning tier (n=7 for scene), diffusers backend with stabilityai/sd-turbo and openai/clip-vit-base-patch32.
What this pilot reports is not a controllability score, it is a calibration of the instrument. The committed pilot is an honest null: no conditioning tier beats the shuffled-label control, so recovery is statistically indistinguishable from chance label agreement. The protocol, floors, and diagnostics all run end to end; the binding limitation is the probe.
| Condition | Recovery acc [95% CI] | Macro-F1 | Valence rho [95% CI] | Arousal rho [95% CI] | CLIP-T | Shuffled-label p | n | Reading |
|---|---|---|---|---|---|---|---|---|
raw (neg. control) |
0.143 [0.00, 0.357] | 0.038 | 0.076 [-0.47, 0.55] | 0.546 [-0.00, 0.90] | 0.280 | 0.857 | 14 | sits exactly at chance (1/7) |
emotion |
0.214 [0.00, 0.429] | 0.112 | 0.474 [-0.03, 0.78] | 0.474 [-0.03, 0.80] | 0.274 | 0.226 | 14 | not above the circularity baseline |
affect |
0.214 [0.00, 0.429] | 0.133 | 0.241 [-0.30, 0.65] | 0.618 [0.19, 0.86] | 0.270 | 0.137 | 14 | not above the circularity baseline |
scene (pos. control) |
0.286 [0.00, 0.575] | 0.184 | 0.414 [-0.63, 1.00] | 0.582 [-0.43, 0.99] | – | 0.145 | 7 | highest, but still n.s. |
Chance = 0.143 (1/7); majority-class baseline = 0.143. Probe health: 2/7 labels used, neutral predicted for ~90% of items (majority_rate 0.9048).
Paired contrasts (bootstrap on per-item recovery):
| Contrast | Delta accuracy | 95% CI | p |
|---|---|---|---|
| emotion vs raw | +0.071 | [0.000, 0.214] | 0.255 |
| affect vs emotion | +0.000 | [0.000, 0.000] | 1.000 |
| affect vs raw | +0.071 | [0.000, 0.214] | 0.255 |
- The
rawnegative control sits exactly at chance (1/7), andsceneat 0.286 is the highest tier but still not significant. - The CLIP ViT-B/32 probe collapses in-domain onto
neutral, predicting it for ~90% of generated scenes and using only 2 of 7 labels. - The one apparent lift is a single image and not significant: emotion over raw is +0.071 (95% CI [0.000, 0.214], p=0.255), and affect adds nothing over emotion (delta 0.000, p=1.000).
- Out of domain on faces (n=200), CLIP recovers the Ekman emotion at only 29.0% accuracy (macro-F1 0.22), usable for neutral (recall 0.81) and anger (0.61) but near-random for surprise, fear, sadness, and disgust.
The full write-up, figures, and confidence intervals are in paper/paper.md.
- Detect: a DistilRoBERTa classifier (
affect/analyzer.py) scores the six Ekman emotions plus neutral (seven labels) in the input text. - Ground: valence and arousal are estimated from an affect lexicon (
affect/lexicon.py) and blended with the emotion's circumplex prior by lexical coverage c,v = c·v_lex + (1−c)·v_prior, so affect is measured from text rather than read from a constant. - Condition: image content stays independent of the emotion; emotion enters only as a modifier over four tiers (raw → naive → emotion → affect). The tiers are the ablation, so recovery is attributable to the conditioning, not a canned scene.
- Generate: Stable Diffusion Turbo renders the image from a fixed, paired-per-item seed through a common
ImageBackend(null for tests, diffusers for local, hf-api for hosted). - Recover: a swappable probe (default
CLIPProbe,eval/probes.py) reads the emotion and graded valence/arousal back from the image; recovery only counts when it clears the majority-class baseline and the shuffled-label control with a non-degenerate probe.
- Decoupled content. The depicted subject is never chosen by the emotion (a 20-subject neutral content bank,
data/content_bank.txt), so any signal must come through the modifier, not the scene. - Two floors bound the claim.
rawis the negative control (no emotion, should sit at chance 1/7) andsceneis the positive control and template ceiling, so scene > raw checks the instrument is not blind. - Shuffled-label control. A one-sided permutation test (n=2000) of each tier against randomly reassigned targets quantifies the circularity baseline directly; recovery is evidence only when it clears this null.
- Probe-collapse diagnostic. Every run emits
probe_health(label diversity and majority-collapse rate) and reports the majority-class baseline beside each accuracy, because a collapsed probe scores at chance on raw trivially. - Probe validation as a known-error instrument. Out-of-domain faces (
make validate-probe, n=200) record 29.0%; in-domain EmoSet scenes (make validate-probe-scene) record the operating ceiling; an independent non-CLIP probe (--probe hf,make robustness) and a human study (eval/human_study.py, Cohen's kappa) slot into the same interface. - Pure-numpy, unit-tested statistics. Accuracy, macro-F1, confusion, Pearson r and Spearman rho with bootstrap 95% CIs, paired bootstrap contrasts, and Cohen's kappa (
eval/metrics.py). - Full provenance.
results.jsonlogs git SHA, Python and library versions, device, dtype, model and dataset revisions, and the benchmark hash; tables and figures are regenerated by scripts, never hand-written. - AffectBench hygiene. Single-label GoEmotions mapped to seven Ekman classes from the test split, with within-sample and cross-split dedup so no eval sentence could have been trained on; the build records realized per-class counts and a balanced flag.
| Layer | Tools |
|---|---|
| ML / NLP | PyTorch, Hugging Face Transformers, Diffusers (SD-Turbo), CLIP (ViT-B/32), DistilRoBERTa emotion classifier |
| Application | Python, Flask (server.py), Gradio (app.py), flask-cors |
| Data / research | NumPy, Pillow, pydantic / pydantic-settings, matplotlib, Hugging Face datasets (GoEmotions, EmoSet) |
| Tooling / CI | pytest, ruff (lint + format), mypy, gitleaks, pip-audit, GitHub Actions (Python 3.9-3.12), Docker, uv |
make setup # core + dev deps (deterministic core, tests, lint)
make test # the test suite, no models needed (runs in seconds)
make lint # ruff check + format
# Run the real models (downloads SD-Turbo, CLIP, the emotion classifier):
make setup-ml
make app # launch the web app at http://127.0.0.1:8000
make smoke # quick end-to-end run (2 subjects, 1 seed)
# Reproduce the paper artifacts:
make pilot # the committed CPU pilot (256-px, 2 subjects, 1 seed -> n=14)
make reproduce # canonical content-track run, 512-px, 3 seeds (needs a GPU box)
make validate-probe # probe error on faces (out-of-domain proxy)
make validate-probe-scene # in-domain probe ceiling on EmoSet scenes
make paper # regenerate the paper tables/figures from results/paper/results.json
# Exact paper environment:
uv pip install -r requirements.lock
Requires Python 3.9 to 3.12 (all tested in CI). The pilot results and figures are committed under results/paper/, so make paper and the 112 tests run without re-downloading models or raw data.
- A recovery probe that actually reads generated scenes, validating a stronger or independent probe (
--probe hf, ViT-L/14) on EmoSet before trusting any recovery number - The powered run (
make reproduce: 512-px, 20 subjects, 3 seeds, n=420) on a GPU box, once a non-degenerate probe clears the in-domain ceiling - Scale the human study to 3+ raters and report Cohen's kappa against the probe
- Cross-system comparison across generators, styles, and probes to turn this into a ranking benchmark, including EmoGen, EmotiCrafter, and CoEmoGen
- Mixed and compound emotions beyond the Ekman set
- Setup, test, and reproduction steps: CONTRIBUTING.md
- Participation is governed by the Code of Conduct
- Vulnerabilities go through the security policy

