Skip to content

feat: Add EvaluateCallback with disk-offloading and Metric Cards#62

Open
Anurag9Dhiman wants to merge 2 commits into
galilai-group:mainfrom
Anurag9Dhiman:feature/evaluate-callback
Open

feat: Add EvaluateCallback with disk-offloading and Metric Cards#62
Anurag9Dhiman wants to merge 2 commits into
galilai-group:mainfrom
Anurag9Dhiman:feature/evaluate-callback

Conversation

@Anurag9Dhiman

Copy link
Copy Markdown

Summary

Adds a new EvaluateCallback for PyTorch Lightning that wraps the Hugging Face
evaluate library and offloads predictions to disk using the Arrow (datasets)
backend. This prevents OOM errors during large-scale evaluations while ensuring
reproducible, standardized benchmark scores.

Motivation

When evaluating large models on big datasets (e.g., ImageNet-1k, full-scale
retrieval), storing all predictions in RAM until epoch end causes memory spikes
and often OOM crashes. This PR solves that by:

  1. Disk-offloading predictions incrementally via Apache Arrow IPC files
  2. Distributed syncing across multiple GPUs using per-rank sharding
  3. Standardized scoring through HF evaluate's versioned, reproducible logic
  4. Automated Metric Cards for instant Hugging Face Hub compatibility

New Module: stable_datasets/callbacks.py

PredictionDiskWriter

  • Incrementally writes batches of predictions/references to Arrow IPC files
  • Accepts torch.Tensor, numpy.ndarray, and Python lists
  • Memory usage stays constant regardless of dataset size

EvaluateCallback(lightning.pytorch.callbacks.Callback)

  • on_validation_batch_end: Captures model outputs via a user-defined
    input_format_fn and flushes them to disk immediately
  • on_validation_epoch_end: Synchronizes all GPU ranks via
    torch.distributed.barrier(), aggregates Arrow shards on Rank 0,
    computes metrics using HF evaluate, and logs results to Lightning
  • Each GPU writes its own shard file (rank_0.arrow, rank_1.arrow, etc.)
    to avoid file contention in distributed settings

generate_metric_card()

  • Produces a Markdown report with YAML model_index metadata
  • Output is directly compatible with HF Hub Model Card format
  • Allows users to publish verified, reproducible benchmark results

Usage Example

from stable_datasets import EvaluateCallback

def format_fn(outputs):
    return {
        "predictions": outputs["logits"].argmax(dim=-1),
        "references": outputs["labels"],
    }

callback = EvaluateCallback(
    metric_name="accuracy",
    input_format_fn=format_fn,
    hub_model_id="my-model",
)

trainer = pl.Trainer(callbacks=[callback])

Anurag Dhiman added 2 commits April 15, 2026 11:34
- New module: stable_datasets/callbacks.py
  - PredictionDiskWriter: Arrow-based incremental disk writer for predictions
  - EvaluateCallback: Lightning Callback with distributed sync and HF evaluate
  - generate_metric_card: Automated Hub-compatible Metric Card generation
- Updated __init__.py to export EvaluateCallback
- Added test_callback.py with unit tests for disk writer and card generation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant