NovaVision

NovaVision generates an image from the emotion of a sentence, then checks whether the image actually conveys that emotion by recovering the emotion back from it with a swappable probe and comparing it to what you asked for. It is two things sharing one pipeline, a text-to-art web app and an emotion-controllability benchmark, built around the failure modes that make such a measurement easy to fake, with a frozen text benchmark (AffectBench) and a full write-up behind it.

Using it

Run make setup then make test to exercise the deterministic core with no model downloads, then make setup-ml and make app to launch the web app at http://127.0.0.1:8000 and generate emotion-conditioned images interactively.
make smoke is a quick end-to-end run (2 subjects, 1 seed); make pilot reproduces the committed CPU pilot (n=14 per tier).
Build the text benchmark with make benchmark (AffectBench from GoEmotions), then run the text-conditioned track with make text.
For a public deployment, bind explicitly with NOVA_PUBLIC=1 and set NOVA_API_TOKEN, NOVA_RATE_LIMIT, and NOVA_MAX_CONCURRENCY; the app binds 127.0.0.1 by default.

Results

A CPU pilot: 256-px, 2 content subjects, single seed, n=14 per conditioning tier (n=7 for scene), diffusers backend with stabilityai/sd-turbo and openai/clip-vit-base-patch32.

What this pilot reports is not a controllability score, it is a calibration of the instrument. The committed pilot is an honest null: no conditioning tier beats the shuffled-label control, so recovery is statistically indistinguishable from chance label agreement. The protocol, floors, and diagnostics all run end to end; the binding limitation is the probe.

Condition	Recovery acc [95% CI]	Macro-F1	Valence rho [95% CI]	Arousal rho [95% CI]	CLIP-T	Shuffled-label p	n	Reading
`raw` (neg. control)	0.143 [0.00, 0.357]	0.038	0.076 [-0.47, 0.55]	0.546 [-0.00, 0.90]	0.280	0.857	14	sits exactly at chance (1/7)
`emotion`	0.214 [0.00, 0.429]	0.112	0.474 [-0.03, 0.78]	0.474 [-0.03, 0.80]	0.274	0.226	14	not above the circularity baseline
`affect`	0.214 [0.00, 0.429]	0.133	0.241 [-0.30, 0.65]	0.618 [0.19, 0.86]	0.270	0.137	14	not above the circularity baseline
`scene` (pos. control)	0.286 [0.00, 0.575]	0.184	0.414 [-0.63, 1.00]	0.582 [-0.43, 0.99]	–	0.145	7	highest, but still n.s.

Chance = 0.143 (1/7); majority-class baseline = 0.143. Probe health: 2/7 labels used, neutral predicted for ~90% of items (majority_rate 0.9048).

Paired contrasts (bootstrap on per-item recovery):

Contrast	Delta accuracy	95% CI	p
emotion vs raw	+0.071	[0.000, 0.214]	0.255
affect vs emotion	+0.000	[0.000, 0.000]	1.000
affect vs raw	+0.071	[0.000, 0.214]	0.255

The raw negative control sits exactly at chance (1/7), and scene at 0.286 is the highest tier but still not significant.
The CLIP ViT-B/32 probe collapses in-domain onto neutral, predicting it for ~90% of generated scenes and using only 2 of 7 labels.
The one apparent lift is a single image and not significant: emotion over raw is +0.071 (95% CI [0.000, 0.214], p=0.255), and affect adds nothing over emotion (delta 0.000, p=1.000).
Out of domain on faces (n=200), CLIP recovers the Ekman emotion at only 29.0% accuracy (macro-F1 0.22), usable for neutral (recall 0.81) and anger (0.61) but near-random for surprise, fear, sadness, and disgust.

The full write-up, figures, and confidence intervals are in paper/paper.md.

How it works

Detect: a DistilRoBERTa classifier (affect/analyzer.py) scores the six Ekman emotions plus neutral (seven labels) in the input text.
Ground: valence and arousal are estimated from an affect lexicon (affect/lexicon.py) and blended with the emotion's circumplex prior by lexical coverage c, v = c·v_lex + (1−c)·v_prior, so affect is measured from text rather than read from a constant.
Condition: image content stays independent of the emotion; emotion enters only as a modifier over four tiers (raw → naive → emotion → affect). The tiers are the ablation, so recovery is attributable to the conditioning, not a canned scene.
Generate: Stable Diffusion Turbo renders the image from a fixed, paired-per-item seed through a common ImageBackend (null for tests, diffusers for local, hf-api for hosted).
Recover: a swappable probe (default CLIPProbe, eval/probes.py) reads the emotion and graded valence/arousal back from the image; recovery only counts when it clears the majority-class baseline and the shuffled-label control with a non-degenerate probe.

Method

Decoupled content. The depicted subject is never chosen by the emotion (a 20-subject neutral content bank, data/content_bank.txt), so any signal must come through the modifier, not the scene.
Two floors bound the claim. raw is the negative control (no emotion, should sit at chance 1/7) and scene is the positive control and template ceiling, so scene > raw checks the instrument is not blind.
Shuffled-label control. A one-sided permutation test (n=2000) of each tier against randomly reassigned targets quantifies the circularity baseline directly; recovery is evidence only when it clears this null.
Probe-collapse diagnostic. Every run emits probe_health (label diversity and majority-collapse rate) and reports the majority-class baseline beside each accuracy, because a collapsed probe scores at chance on raw trivially.
Probe validation as a known-error instrument. Out-of-domain faces (make validate-probe, n=200) record 29.0%; in-domain EmoSet scenes (make validate-probe-scene) record the operating ceiling; an independent non-CLIP probe (--probe hf, make robustness) and a human study (eval/human_study.py, Cohen's kappa) slot into the same interface.
Pure-numpy, unit-tested statistics. Accuracy, macro-F1, confusion, Pearson r and Spearman rho with bootstrap 95% CIs, paired bootstrap contrasts, and Cohen's kappa (eval/metrics.py).
Full provenance. results.json logs git SHA, Python and library versions, device, dtype, model and dataset revisions, and the benchmark hash; tables and figures are regenerated by scripts, never hand-written.
AffectBench hygiene. Single-label GoEmotions mapped to seven Ekman classes from the test split, with within-sample and cross-split dedup so no eval sentence could have been trained on; the build records realized per-class counts and a balanced flag.

Tech stack

Layer	Tools
ML / NLP	PyTorch, Hugging Face Transformers, Diffusers (SD-Turbo), CLIP (ViT-B/32), DistilRoBERTa emotion classifier
Application	Python, Flask (server.py), Gradio (app.py), flask-cors
Data / research	NumPy, Pillow, pydantic / pydantic-settings, matplotlib, Hugging Face datasets (GoEmotions, EmoSet)
Tooling / CI	pytest, ruff (lint + format), mypy, gitleaks, pip-audit, GitHub Actions (Python 3.9-3.12), Docker, uv

Reproduce

make setup          # core + dev deps (deterministic core, tests, lint)
make test           # the test suite, no models needed (runs in seconds)
make lint           # ruff check + format

# Run the real models (downloads SD-Turbo, CLIP, the emotion classifier):
make setup-ml
make app            # launch the web app at http://127.0.0.1:8000
make smoke          # quick end-to-end run (2 subjects, 1 seed)

# Reproduce the paper artifacts:
make pilot          # the committed CPU pilot (256-px, 2 subjects, 1 seed -> n=14)
make reproduce      # canonical content-track run, 512-px, 3 seeds (needs a GPU box)
make validate-probe         # probe error on faces (out-of-domain proxy)
make validate-probe-scene   # in-domain probe ceiling on EmoSet scenes
make paper          # regenerate the paper tables/figures from results/paper/results.json

# Exact paper environment:
uv pip install -r requirements.lock

Requires Python 3.9 to 3.12 (all tested in CI). The pilot results and figures are committed under results/paper/, so make paper and the 112 tests run without re-downloading models or raw data.

Future scope

A recovery probe that actually reads generated scenes, validating a stronger or independent probe (--probe hf, ViT-L/14) on EmoSet before trusting any recovery number
The powered run (make reproduce: 512-px, 20 subjects, 3 seeds, n=420) on a GPU box, once a non-degenerate probe clears the in-domain ceiling
Scale the human study to 3+ raters and report Cohen's kappa against the probe
Cross-system comparison across generators, styles, and probes to turn this into a ranking benchmark, including EmoGen, EmotiCrafter, and CoEmoGen
Mixed and compound emotions beyond the Ekman set

Contributing

Setup, test, and reproduction steps: CONTRIBUTING.md
Participation is governed by the Code of Conduct
Vulnerabilities go through the security policy

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
.github/workflows		.github/workflows
data		data
novavision		novavision
paper		paper
results/paper		results/paper
screenshots		screenshots
scripts		scripts
static		static
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.mailmap		.mailmap
.zenodo.json		.zenodo.json
ARCHITECTURE.md		ARCHITECTURE.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
app.py		app.py
pyproject.toml		pyproject.toml
requirements.lock		requirements.lock
requirements.txt		requirements.txt
server.py		server.py
spaces_config.yaml		spaces_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NovaVision

Using it

Results

How it works

Method

Tech stack

Reproduce

Future scope

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NovaVision

Using it

Results

How it works

Method

Tech stack

Reproduce

Future scope

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages