galilai-group · ianhajra · Mar 22, 2026 · Mar 22, 2026 · Mar 22, 2026 · Mar 22, 2026
diff --git a/docs/source/datasets/clevr.rst b/docs/source/datasets/clevr.rst
@@ -0,0 +1,204 @@
+CLEVR
+=====
+
+.. raw:: html
+
+   <p style="display: flex; gap: 10px;">
+   <img src="https://img.shields.io/badge/Task-Visual%20Reasoning-blue" alt="Task: Visual Reasoning">
+   <img src="https://img.shields.io/badge/Images-100%2C000-green" alt="Images: 100,000">
+   <img src="https://img.shields.io/badge/Questions-~865%2C000-orange" alt="Questions: ~865,000">
+   <img src="https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey" alt="License: CC BY 4.0">
+   </p>
+
+Overview
+--------
+
+**CLEVR** (Compositional Language and Elementary Visual Reasoning) is a diagnostic dataset for testing a broad range of visual reasoning abilities. It contains synthetic images of simple 3D objects — cubes, spheres, and cylinders — rendered in varying colors, materials, and sizes, alongside automatically generated question-answer pairs designed to probe specific reasoning skills with minimal dataset bias.
+
+CLEVR questions span five reasoning types:
+
+- **Attribute identification**: "What color is the large cube?"
+- **Counting**: "How many objects are either small cylinders or red things?"
+- **Comparison**: "Is the sphere the same size as the metal cube?"
+- **Spatial relationships**: "What size is the cylinder that is left of the brown metal thing?"
+- **Logical operations**: "Are there an equal number of large things and metal spheres?"
+
+Split sizes:
+
+- **Train**: 70,000 images · 699,989 questions · scene graphs · functional programs
+- **Val**: 15,000 images · 149,991 questions · scene graphs · functional programs
+- **Test**: 15,000 images · 14,988 questions (no answers or scene annotations)
+
+.. image:: teasers/clevr_teaser.png
+   :align: center
+   :width: 90%
+
+Data Structure
+--------------
+
+When accessing an example using ``ds[i]``, you will receive a dictionary with the following keys:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 20 60
+
+   * - Key
+     - Type
+     - Description
+   * - ``image``
+     - ``PIL.Image.Image``
+     - RGB rendered scene image (320×480 pixels)
+   * - ``image_filename``
+     - str
+     - Original filename (e.g., ``"CLEVR_train_000000.png"``)
+   * - ``image_index``
+     - int
+     - Zero-based index of the image within its split
+   * - ``scene_json``
+     - str
+     - JSON string with the ground-truth scene graph (objects, attributes, spatial relations). Empty ``{}`` for the test split.
+   * - ``questions_json``
+     - str
+     - JSON string containing the list of questions associated with this image. Test-split questions omit ``answer`` and ``program`` fields.
+
+Scene JSON Structure
+--------------------
+
+``scene_json`` decodes to a dict with the following shape:
+
+.. code-block:: python
+
+    {
+        "image_index": 0,
+        "image_filename": "CLEVR_train_000000.png",
+        "objects": [
+            {
+                "color": "blue",         # gray | blue | brown | yellow | red | green | purple | cyan
+                "material": "rubber",    # rubber | metal
+                "shape": "sphere",       # cube | sphere | cylinder
+                "size": "large",         # small | large
+                "3d_coords": [x, y, z],
+                "pixel_coords": [x, y, z],
+                "rotation": 315.0        # degrees
+            },
+            ...
+        ],
+        "relationships": {
+            "left":   [[...], ...],  # adjacency lists: relationships["left"][i] = indices of objects left of objects[i]
+            "right":  [[...], ...],
+            "front":  [[...], ...],
+            "behind": [[...], ...]
+        },
+        "directions": {
+            "left": [x, y, z], "right": [x, y, z],
+            "front": [x, y, z], "behind": [x, y, z],
+            "below": [x, y, z], "above": [x, y, z]
+        }
+    }
+
+Questions JSON Structure
+------------------------
+
+``questions_json`` decodes to a list of question dicts:
+
+.. code-block:: python
+
+    [
+        {
+            "image_index": 0,
+            "image_filename": "CLEVR_train_000000.png",
+            "question": "How many blue cubes are there?",
+            "answer": "2",               # omitted in test split
+            "question_family_index": 12,
+            "program": [                 # omitted in test split
+                {"function": "scene",        "inputs": [],  "value_inputs": []},
+                {"function": "filter_color", "inputs": [0], "value_inputs": ["blue"]},
+                {"function": "filter_shape", "inputs": [1], "value_inputs": ["cube"]},
+                {"function": "count",        "inputs": [2], "value_inputs": []}
+            ]
+        },
+        ...
+    ]
+
+Programs are stored as topologically sorted lists of functions. Each function may consume outputs from earlier functions (referenced by index in ``inputs``) and/or literal string values (in ``value_inputs``).
+
+Usage Example
+-------------
+
+**Basic Usage**
+
+.. code-block:: python
+
+    import json
+    from stable_datasets.images.clevr import CLEVR
+
+    # First run will download + prepare cache (~18 GB), subsequent runs load from cache
+    ds = CLEVR(split="train")
+
+    # Omit split to receive a DatasetDict with all available splits
+    ds_all = CLEVR(split=None)
+
+    sample = ds[0]
+    print(sample.keys())  # {"image", "image_filename", "image_index", "scene_json", "questions_json"}
+
+**Inspecting Scene and Questions**
+
+.. code-block:: python
+
+    import json
+    from stable_datasets.images.clevr import CLEVR
+
+    ds = CLEVR(split="val")
+    sample = ds[0]
+
+    scene = json.loads(sample["scene_json"])
+    print(f"Objects in scene: {len(scene['objects'])}")
+    for obj in scene["objects"]:
+        print(f"  {obj['size']} {obj['color']} {obj['material']} {obj['shape']}")
+
+    questions = json.loads(sample["questions_json"])
+    for q in questions[:3]:
+        print(f"Q: {q['question']}")
+        print(f"A: {q['answer']}")
+
+**Test Split (no answers)**
+
+.. code-block:: python
+
+    from stable_datasets.images.clevr import CLEVR
+
+    ds_test = CLEVR(split="test")
+    sample = ds_test[0]
+
+    # scene_json is an empty dict for the test split
+    # questions lack "answer" and "program" keys
+    import json
+    questions = json.loads(sample["questions_json"])
+    print(questions[0].keys())  # {"image_index", "image_filename", "question", "question_family_index"}
+
+Related Datasets
+----------------
+
+- :doc:`clevrer`: CLEVRER — a video extension of CLEVR that adds temporal and causal reasoning questions over collision events
+
+References
+----------
+
+- Official website: https://cs.stanford.edu/people/jcjohns/clevr/
+- Paper (arXiv): https://arxiv.org/abs/1612.06890
+- Dataset generation code: https://github.com/facebookresearch/clevr-dataset-gen
+- License: `Creative Commons CC BY 4.0 <https://creativecommons.org/licenses/by/4.0/>`_
+
+Citation
+--------
+
+.. code-block:: bibtex
+
+    @inproceedings{johnson2017clevr,
+        title     = {CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning},
+        author    = {Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and
+                     Fei-Fei, Li and Zitnick, C. Lawrence and Girshick, Ross},
+        booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+        pages     = {2901--2910},
+        year      = {2017}
+    }
diff --git a/docs/source/datasets/index.rst b/docs/source/datasets/index.rst
@@ -51,6 +51,7 @@ Available Datasets
    cifar10_c
    cifar100_c
    cars196
+   clevr
    dtd
    fashion_mnist
    k_mnist

diff --git a/docs/source/datasets/teasers/clevr_teaser.png b/docs/source/datasets/teasers/clevr_teaser.png
diff --git a/stable_datasets/images/__init__.py b/stable_datasets/images/__init__.py
@@ -10,6 +10,7 @@
 from .cifar10_c import CIFAR10C
 from .cifar100 import CIFAR100
 from .cifar100_c import CIFAR100C
+from .clevr import CLEVR
 from .clevrer import CLEVRER
 from .country211 import Country211
 from .cub200 import CUB200
@@ -58,6 +59,7 @@
     "CIFAR10C",
     "CIFAR100",
     "CIFAR100C",
+    "CLEVR",
     "CLEVRER",
     "Country211",
     "CUB200",

diff --git a/stable_datasets/images/clevr.py b/stable_datasets/images/clevr.py
@@ -0,0 +1,119 @@
+import io
+import json
+import zipfile
+
+from PIL import Image as PILImage
+
+from stable_datasets.schema import DatasetInfo, Features, Image, Value, Version
+from stable_datasets.utils import BaseDatasetBuilder
+
+
+class CLEVR(BaseDatasetBuilder):
+    """CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.
+
+    CLEVR contains 100,000 rendered images of simple 3D objects (cubes, spheres, cylinders)
+    with varying colors, materials, sizes, and positions. Each image is paired with ground-truth
+    scene metadata (object attributes and spatial relations) and automatically generated
+    question-answer pairs that test a range of visual reasoning skills.
+
+    Splits:
+        - train: 70,000 images with scene and question annotations
+        - val:   15,000 images with scene and question annotations
+        - test:  15,000 images with questions only (no answers, no scene data)
+    """
+
+    VERSION = Version("1.0.0")
+
+    SOURCE = {
+        "homepage": "https://cs.stanford.edu/people/jcjohns/clevr/",
+        "assets": {
+            "train": "https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0.zip",
+            "val": "https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0.zip",
+            "test": "https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0.zip",
+        },
+        "citation": """@inproceedings{johnson2017clevr,
+            title={CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning},
+            author={Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Zitnick, C Lawrence and Girshick, Ross},
+            booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
+            pages={2901--2910},
+            year={2017}
+        }""",
+    }
+
+    # Number of images per split, used to iterate image indices.
+    _SPLIT_SIZES = {"train": 70000, "val": 15000, "test": 15000}
+
+    def _info(self):
+        return DatasetInfo(
+            description="""CLEVR is a diagnostic dataset for compositional language and elementary visual
+                           reasoning. It contains 100,000 images of 3D-rendered objects (cubes, spheres,
+                           cylinders) in varying colors, materials, and sizes, alongside ground-truth scene
+                           graphs and question-answer pairs that test counting, comparison, querying, and
+                           spatial reasoning.""",
+            features=Features(
+                {
+                    "image": Image(),
+                    "image_filename": Value("string"),
+                    "image_index": Value("int32"),
+                    # JSON-serialised scene dict (keys: objects, relations, directions, etc.)
+                    # Empty JSON object '{}' for the test split, which has no scene annotations.
+                    "scene_json": Value("string"),
+                    # JSON-serialised list of question dicts for this image.
+                    # Test-split questions omit 'answer' and 'program' fields.
+                    "questions_json": Value("string"),
+                }
+            ),
+            supervised_keys=None,
+            homepage=self.SOURCE["homepage"],
+            citation=self.SOURCE["citation"],
+        )
+
+    def _generate_examples(self, data_path, split):
+        """Generate examples from the CLEVR_v1.0.zip archive."""
+        with zipfile.ZipFile(data_path, "r") as zf:
+            # ------------------------------------------------------------------
+            # 1. Scene annotations (train / val only; test has none)
+            # ------------------------------------------------------------------
+            scene_lookup = {}
+            if split in ("train", "val"):
+                scene_path = f"CLEVR_v1.0/scenes/CLEVR_{split}_scenes.json"
+                with zf.open(scene_path) as f:
+                    scenes_data = json.load(f)
+                for scene in scenes_data["scenes"]:
+                    scene_lookup[scene["image_index"]] = scene
+
+            # ------------------------------------------------------------------
+            # 2. Question annotations (all splits; test lacks answers/programs)
+            # ------------------------------------------------------------------
+            question_path = f"CLEVR_v1.0/questions/CLEVR_{split}_questions.json"
+            with zf.open(question_path) as f:
+                questions_data = json.load(f)
+
+            questions_by_image = {}
+            for q in questions_data["questions"]:
+                questions_by_image.setdefault(q["image_index"], []).append(q)
+
+            # ------------------------------------------------------------------
+            # 3. Yield one example per image
+            # ------------------------------------------------------------------
+            num_images = self._SPLIT_SIZES[split]
+            for image_index in range(num_images):
+                image_filename = f"CLEVR_{split}_{image_index:06d}.png"
+                image_zip_path = f"CLEVR_v1.0/images/{split}/{image_filename}"
+
+                with zf.open(image_zip_path) as img_file:
+                    image = PILImage.open(io.BytesIO(img_file.read())).convert("RGB")
+
+                scene = scene_lookup.get(image_index, {})
+                questions = questions_by_image.get(image_index, [])
+
+                yield (
+                    image_index,
+                    {
+                        "image": image,
+                        "image_filename": image_filename,
+                        "image_index": image_index,
+                        "scene_json": json.dumps(scene),
+                        "questions_json": json.dumps(questions),
+                    },
+                )
-Original file line number
+Diff line change
@@ Expand Up / @@ -51,6 +51,7 @@ Available Datasets @@
        cifar10_c
        cifar100_c
        cars196
+       clevr
        dtd
        fashion_mnist
        k_mnist
@@ Expand Down @@