Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 204 additions & 0 deletions docs/source/datasets/clevr.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
CLEVR
=====

.. raw:: html

<p style="display: flex; gap: 10px;">
<img src="https://img.shields.io/badge/Task-Visual%20Reasoning-blue" alt="Task: Visual Reasoning">
<img src="https://img.shields.io/badge/Images-100%2C000-green" alt="Images: 100,000">
<img src="https://img.shields.io/badge/Questions-~865%2C000-orange" alt="Questions: ~865,000">
<img src="https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey" alt="License: CC BY 4.0">
</p>

Overview
--------

**CLEVR** (Compositional Language and Elementary Visual Reasoning) is a diagnostic dataset for testing a broad range of visual reasoning abilities. It contains synthetic images of simple 3D objects — cubes, spheres, and cylinders — rendered in varying colors, materials, and sizes, alongside automatically generated question-answer pairs designed to probe specific reasoning skills with minimal dataset bias.

CLEVR questions span five reasoning types:

- **Attribute identification**: "What color is the large cube?"
- **Counting**: "How many objects are either small cylinders or red things?"
- **Comparison**: "Is the sphere the same size as the metal cube?"
- **Spatial relationships**: "What size is the cylinder that is left of the brown metal thing?"
- **Logical operations**: "Are there an equal number of large things and metal spheres?"

Split sizes:

- **Train**: 70,000 images · 699,989 questions · scene graphs · functional programs
- **Val**: 15,000 images · 149,991 questions · scene graphs · functional programs
- **Test**: 15,000 images · 14,988 questions (no answers or scene annotations)

.. image:: teasers/clevr_teaser.png
:align: center
:width: 90%

Data Structure
--------------

When accessing an example using ``ds[i]``, you will receive a dictionary with the following keys:

.. list-table::
:header-rows: 1
:widths: 20 20 60

* - Key
- Type
- Description
* - ``image``
- ``PIL.Image.Image``
- RGB rendered scene image (320×480 pixels)
* - ``image_filename``
- str
- Original filename (e.g., ``"CLEVR_train_000000.png"``)
* - ``image_index``
- int
- Zero-based index of the image within its split
* - ``scene_json``
- str
- JSON string with the ground-truth scene graph (objects, attributes, spatial relations). Empty ``{}`` for the test split.
* - ``questions_json``
- str
- JSON string containing the list of questions associated with this image. Test-split questions omit ``answer`` and ``program`` fields.

Scene JSON Structure
--------------------

``scene_json`` decodes to a dict with the following shape:

.. code-block:: python

{
"image_index": 0,
"image_filename": "CLEVR_train_000000.png",
"objects": [
{
"color": "blue", # gray | blue | brown | yellow | red | green | purple | cyan
"material": "rubber", # rubber | metal
"shape": "sphere", # cube | sphere | cylinder
"size": "large", # small | large
"3d_coords": [x, y, z],
"pixel_coords": [x, y, z],
"rotation": 315.0 # degrees
},
...
],
"relationships": {
"left": [[...], ...], # adjacency lists: relationships["left"][i] = indices of objects left of objects[i]
"right": [[...], ...],
"front": [[...], ...],
"behind": [[...], ...]
},
"directions": {
"left": [x, y, z], "right": [x, y, z],
"front": [x, y, z], "behind": [x, y, z],
"below": [x, y, z], "above": [x, y, z]
}
}

Questions JSON Structure
------------------------

``questions_json`` decodes to a list of question dicts:

.. code-block:: python

[
{
"image_index": 0,
"image_filename": "CLEVR_train_000000.png",
"question": "How many blue cubes are there?",
"answer": "2", # omitted in test split
"question_family_index": 12,
"program": [ # omitted in test split
{"function": "scene", "inputs": [], "value_inputs": []},
{"function": "filter_color", "inputs": [0], "value_inputs": ["blue"]},
{"function": "filter_shape", "inputs": [1], "value_inputs": ["cube"]},
{"function": "count", "inputs": [2], "value_inputs": []}
]
},
...
]

Programs are stored as topologically sorted lists of functions. Each function may consume outputs from earlier functions (referenced by index in ``inputs``) and/or literal string values (in ``value_inputs``).

Usage Example
-------------

**Basic Usage**

.. code-block:: python

import json
from stable_datasets.images.clevr import CLEVR

# First run will download + prepare cache (~18 GB), subsequent runs load from cache
ds = CLEVR(split="train")

# Omit split to receive a DatasetDict with all available splits
ds_all = CLEVR(split=None)

sample = ds[0]
print(sample.keys()) # {"image", "image_filename", "image_index", "scene_json", "questions_json"}

**Inspecting Scene and Questions**

.. code-block:: python

import json
from stable_datasets.images.clevr import CLEVR

ds = CLEVR(split="val")
sample = ds[0]

scene = json.loads(sample["scene_json"])
print(f"Objects in scene: {len(scene['objects'])}")
for obj in scene["objects"]:
print(f" {obj['size']} {obj['color']} {obj['material']} {obj['shape']}")

questions = json.loads(sample["questions_json"])
for q in questions[:3]:
print(f"Q: {q['question']}")
print(f"A: {q['answer']}")

**Test Split (no answers)**

.. code-block:: python

from stable_datasets.images.clevr import CLEVR

ds_test = CLEVR(split="test")
sample = ds_test[0]

# scene_json is an empty dict for the test split
# questions lack "answer" and "program" keys
import json
questions = json.loads(sample["questions_json"])
print(questions[0].keys()) # {"image_index", "image_filename", "question", "question_family_index"}

Related Datasets
----------------

- :doc:`clevrer`: CLEVRER — a video extension of CLEVR that adds temporal and causal reasoning questions over collision events

References
----------

- Official website: https://cs.stanford.edu/people/jcjohns/clevr/
- Paper (arXiv): https://arxiv.org/abs/1612.06890
- Dataset generation code: https://github.com/facebookresearch/clevr-dataset-gen
- License: `Creative Commons CC BY 4.0 <https://creativecommons.org/licenses/by/4.0/>`_

Citation
--------

.. code-block:: bibtex

@inproceedings{johnson2017clevr,
title = {CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning},
author = {Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and
Fei-Fei, Li and Zitnick, C. Lawrence and Girshick, Ross},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {2901--2910},
year = {2017}
}
1 change: 1 addition & 0 deletions docs/source/datasets/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ Available Datasets
cifar10_c
cifar100_c
cars196
clevr
dtd
fashion_mnist
k_mnist
Expand Down
Binary file added docs/source/datasets/teasers/clevr_teaser.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions stable_datasets/images/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from .cifar10_c import CIFAR10C
from .cifar100 import CIFAR100
from .cifar100_c import CIFAR100C
from .clevr import CLEVR
from .clevrer import CLEVRER
from .country211 import Country211
from .cub200 import CUB200
Expand Down Expand Up @@ -58,6 +59,7 @@
"CIFAR10C",
"CIFAR100",
"CIFAR100C",
"CLEVR",
"CLEVRER",
"Country211",
"CUB200",
Expand Down
119 changes: 119 additions & 0 deletions stable_datasets/images/clevr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
import io
import json
import zipfile

from PIL import Image as PILImage

from stable_datasets.schema import DatasetInfo, Features, Image, Value, Version
from stable_datasets.utils import BaseDatasetBuilder


class CLEVR(BaseDatasetBuilder):
"""CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.

CLEVR contains 100,000 rendered images of simple 3D objects (cubes, spheres, cylinders)
with varying colors, materials, sizes, and positions. Each image is paired with ground-truth
scene metadata (object attributes and spatial relations) and automatically generated
question-answer pairs that test a range of visual reasoning skills.

Splits:
- train: 70,000 images with scene and question annotations
- val: 15,000 images with scene and question annotations
- test: 15,000 images with questions only (no answers, no scene data)
"""

VERSION = Version("1.0.0")

SOURCE = {
"homepage": "https://cs.stanford.edu/people/jcjohns/clevr/",
"assets": {
"train": "https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0.zip",
"val": "https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0.zip",
"test": "https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0.zip",
},
"citation": """@inproceedings{johnson2017clevr,
title={CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning},
author={Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Zitnick, C Lawrence and Girshick, Ross},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={2901--2910},
year={2017}
}""",
}

# Number of images per split, used to iterate image indices.
_SPLIT_SIZES = {"train": 70000, "val": 15000, "test": 15000}

def _info(self):
return DatasetInfo(
description="""CLEVR is a diagnostic dataset for compositional language and elementary visual
reasoning. It contains 100,000 images of 3D-rendered objects (cubes, spheres,
cylinders) in varying colors, materials, and sizes, alongside ground-truth scene
graphs and question-answer pairs that test counting, comparison, querying, and
spatial reasoning.""",
features=Features(
{
"image": Image(),
"image_filename": Value("string"),
"image_index": Value("int32"),
# JSON-serialised scene dict (keys: objects, relations, directions, etc.)
# Empty JSON object '{}' for the test split, which has no scene annotations.
"scene_json": Value("string"),
# JSON-serialised list of question dicts for this image.
# Test-split questions omit 'answer' and 'program' fields.
"questions_json": Value("string"),
}
),
supervised_keys=None,
homepage=self.SOURCE["homepage"],
citation=self.SOURCE["citation"],
)

def _generate_examples(self, data_path, split):
"""Generate examples from the CLEVR_v1.0.zip archive."""
with zipfile.ZipFile(data_path, "r") as zf:
# ------------------------------------------------------------------
# 1. Scene annotations (train / val only; test has none)
# ------------------------------------------------------------------
scene_lookup = {}
if split in ("train", "val"):
scene_path = f"CLEVR_v1.0/scenes/CLEVR_{split}_scenes.json"
with zf.open(scene_path) as f:
scenes_data = json.load(f)
for scene in scenes_data["scenes"]:
scene_lookup[scene["image_index"]] = scene

# ------------------------------------------------------------------
# 2. Question annotations (all splits; test lacks answers/programs)
# ------------------------------------------------------------------
question_path = f"CLEVR_v1.0/questions/CLEVR_{split}_questions.json"
with zf.open(question_path) as f:
questions_data = json.load(f)

questions_by_image = {}
for q in questions_data["questions"]:
questions_by_image.setdefault(q["image_index"], []).append(q)

# ------------------------------------------------------------------
# 3. Yield one example per image
# ------------------------------------------------------------------
num_images = self._SPLIT_SIZES[split]
for image_index in range(num_images):
image_filename = f"CLEVR_{split}_{image_index:06d}.png"
image_zip_path = f"CLEVR_v1.0/images/{split}/{image_filename}"

with zf.open(image_zip_path) as img_file:
image = PILImage.open(io.BytesIO(img_file.read())).convert("RGB")

scene = scene_lookup.get(image_index, {})
questions = questions_by_image.get(image_index, [])

yield (
image_index,
{
"image": image,
"image_filename": image_filename,
"image_index": image_index,
"scene_json": json.dumps(scene),
"questions_json": json.dumps(questions),
},
)
Loading
Loading