feat: Add EvaluateCallback with disk-offloading and Metric Cards#62
Open
Anurag9Dhiman wants to merge 2 commits into
Open
feat: Add EvaluateCallback with disk-offloading and Metric Cards#62Anurag9Dhiman wants to merge 2 commits into
Anurag9Dhiman wants to merge 2 commits into
Conversation
added 2 commits
April 15, 2026 11:34
- New module: stable_datasets/callbacks.py - PredictionDiskWriter: Arrow-based incremental disk writer for predictions - EvaluateCallback: Lightning Callback with distributed sync and HF evaluate - generate_metric_card: Automated Hub-compatible Metric Card generation - Updated __init__.py to export EvaluateCallback - Added test_callback.py with unit tests for disk writer and card generation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
EvaluateCallbackfor PyTorch Lightning that wraps the Hugging Faceevaluatelibrary and offloads predictions to disk using the Arrow (datasets)backend. This prevents OOM errors during large-scale evaluations while ensuring
reproducible, standardized benchmark scores.
Motivation
When evaluating large models on big datasets (e.g., ImageNet-1k, full-scale
retrieval), storing all predictions in RAM until epoch end causes memory spikes
and often OOM crashes. This PR solves that by:
evaluate's versioned, reproducible logicNew Module:
stable_datasets/callbacks.pyPredictionDiskWritertorch.Tensor,numpy.ndarray, and Python listsEvaluateCallback(lightning.pytorch.callbacks.Callback)on_validation_batch_end: Captures model outputs via a user-definedinput_format_fnand flushes them to disk immediatelyon_validation_epoch_end: Synchronizes all GPU ranks viatorch.distributed.barrier(), aggregates Arrow shards on Rank 0,computes metrics using HF
evaluate, and logs results to Lightningrank_0.arrow,rank_1.arrow, etc.)to avoid file contention in distributed settings
generate_metric_card()model_indexmetadataUsage Example