feat: Add EvaluateCallback with disk-offloading and Metric Cards by Anurag9Dhiman · Pull Request #62 · galilai-group/stable-datasets

Anurag9Dhiman · 2026-04-15T06:21:14Z

Summary

Adds a new EvaluateCallback for PyTorch Lightning that wraps the Hugging Face
evaluate library and offloads predictions to disk using the Arrow (datasets)
backend. This prevents OOM errors during large-scale evaluations while ensuring
reproducible, standardized benchmark scores.

Motivation

When evaluating large models on big datasets (e.g., ImageNet-1k, full-scale
retrieval), storing all predictions in RAM until epoch end causes memory spikes
and often OOM crashes. This PR solves that by:

Disk-offloading predictions incrementally via Apache Arrow IPC files
Distributed syncing across multiple GPUs using per-rank sharding
Standardized scoring through HF evaluate's versioned, reproducible logic
Automated Metric Cards for instant Hugging Face Hub compatibility

New Module: `stable_datasets/callbacks.py`

`PredictionDiskWriter`

Incrementally writes batches of predictions/references to Arrow IPC files
Accepts torch.Tensor, numpy.ndarray, and Python lists
Memory usage stays constant regardless of dataset size

`EvaluateCallback(lightning.pytorch.callbacks.Callback)`

on_validation_batch_end: Captures model outputs via a user-defined
input_format_fn and flushes them to disk immediately
on_validation_epoch_end: Synchronizes all GPU ranks via
torch.distributed.barrier(), aggregates Arrow shards on Rank 0,
computes metrics using HF evaluate, and logs results to Lightning
Each GPU writes its own shard file (rank_0.arrow, rank_1.arrow, etc.)
to avoid file contention in distributed settings

`generate_metric_card()`

Produces a Markdown report with YAML model_index metadata
Output is directly compatible with HF Hub Model Card format
Allows users to publish verified, reproducible benchmark results

Usage Example

from stable_datasets import EvaluateCallback

def format_fn(outputs):
    return {
        "predictions": outputs["logits"].argmax(dim=-1),
        "references": outputs["labels"],
    }

callback = EvaluateCallback(
    metric_name="accuracy",
    input_format_fn=format_fn,
    hub_model_id="my-model",
)

trainer = pl.Trainer(callbacks=[callback])

- New module: stable_datasets/callbacks.py - PredictionDiskWriter: Arrow-based incremental disk writer for predictions - EvaluateCallback: Lightning Callback with distributed sync and HF evaluate - generate_metric_card: Automated Hub-compatible Metric Card generation - Updated __init__.py to export EvaluateCallback - Added test_callback.py with unit tests for disk writer and card generation

Anurag Dhiman added 2 commits April 15, 2026 11:34

fix: Optimized Arrow memory mapping in EvaluateCallback

c728868

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add EvaluateCallback with disk-offloading and Metric Cards#62

feat: Add EvaluateCallback with disk-offloading and Metric Cards#62
Anurag9Dhiman wants to merge 2 commits into
galilai-group:mainfrom
Anurag9Dhiman:feature/evaluate-callback

Anurag9Dhiman commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Anurag9Dhiman commented Apr 15, 2026

Summary

Motivation

New Module: stable_datasets/callbacks.py

PredictionDiskWriter

EvaluateCallback(lightning.pytorch.callbacks.Callback)

generate_metric_card()

Usage Example

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New Module: `stable_datasets/callbacks.py`

`PredictionDiskWriter`

`EvaluateCallback(lightning.pytorch.callbacks.Callback)`

`generate_metric_card()`