All

50 repositories

VLMEvalKit
Public
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
computer-vision evaluation pytorch
computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip
Python
•
Apache License 2.0
•723•4.2k•209•48•Updated Jun 17, 2026Jun 17, 2026
opencompass
Public
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…
benchmark evaluation openai
benchmark evaluation openai llm chatgpt large-language-model llama2 llama3
Python
•
Apache License 2.0
•791•7.1k•378•85•Updated Jun 17, 2026Jun 17, 2026
PowerBench
Public
0•0•0•0•Updated Jun 9, 2026Jun 9, 2026
CNFinBench
Public
CNFinBench — the first comprehensive benchmark for high-stakes financial scenarios. It spans 29 subtasks grounded in authoritative financial corpora and real bu…
Python
•0•16•0•0•Updated Jun 9, 2026Jun 9, 2026
SearchAgentService
Public
Python
•1•0•0•0•Updated May 22, 2026May 22, 2026
Terminal-Bench-server
Public
Shell
•1•0•0•0•Updated Apr 28, 2026Apr 28, 2026
GTA
Public
[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2
llm-agent llm-evaluation
llm-agent llm-evaluation
Python
•
Apache License 2.0
•10•145•1•0•Updated Apr 20, 2026Apr 20, 2026
SWE-bench-server
Public
Python
•0•1•0•0•Updated Apr 15, 2026Apr 15, 2026
GenEditEvalKit
Public
The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.
Jupyter Notebook
•
MIT License
•4•46•0•0•Updated Apr 12, 2026Apr 12, 2026
pinchbench_server
Public
Python
•0•0•0•0•Updated Apr 3, 2026Apr 3, 2026
TextEdit
Public
We provide TextEdit, a high-quality, multi-scenario text editing benchmark for generation models.
Python
•
MIT License
•0•20•0•0•Updated Mar 16, 2026Mar 16, 2026
MiroFlow
Public
MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.
Python
•
Apache License 2.0
•319•0•0•0•Updated Dec 30, 2025Dec 30, 2025
RePro
Public
[ICLR 2026] Rectifying LLM Thought From Lens of Optimization
reinforcement-learning large-language-model large-language-model-reasoning
reinforcement-learning large-language-model large-language-model-reasoning
Python
•
MIT License
•4•15•1•0•Updated Dec 5, 2025Dec 5, 2025
SAGA
Public
The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."
0•11•1•0•Updated Nov 27, 2025Nov 27, 2025
ATLAS
Public
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
2•7•0•0•Updated Nov 20, 2025Nov 20, 2025
OASIS
Public
Python
•0•3•0•0•Updated Nov 12, 2025Nov 12, 2025
InteractScience
Public
JavaScript
•
Apache License 2.0
•0•8•0•0•Updated Oct 31, 2025Oct 31, 2025
CognitiveKernel-Pro
Public
Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414
Python
•
Other
•56•0•0•0•Updated Oct 27, 2025Oct 27, 2025
GAOKAO-Eval
Public
Jupyter Notebook
•7•122•5•0•Updated Oct 7, 2025Oct 7, 2025
.github
Public
1•0•0•0•Updated Sep 9, 2025Sep 9, 2025
MMBench-GUI
Public
Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical mann…
benchmark-framework vision-language-model computer-use
benchmark-framework vision-language-model computer-use gui-agent
Python
•5•112•6•0•Updated Sep 8, 2025Sep 8, 2025
ReasonZoo
Public
Python
•
Apache License 2.0
•0•3•0•0•Updated Aug 27, 2025Aug 27, 2025
CompassVerifier
Public
[EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Jupyter Notebook
•2•69•0•0•Updated Aug 10, 2025Aug 10, 2025
GPassK
Public
[ACL 2025] Are Your LLMs Capable of Stable Reasoning?
large-language-model-evaluation reasoning-stability
large-language-model-evaluation reasoning-stability
Python
•2•33•2•0•Updated Aug 5, 2025Aug 5, 2025
Creation-MMBench
Public
Assessing Context-Aware Creative Intelligence in MLLMs
JavaScript
•0•23•1•0•Updated Jul 22, 2025Jul 22, 2025
CompassJudger
Public
The All-in-one Judge Models introduced by Opencompass
Apache License 2.0
•6•119•1•0•Updated Jul 15, 2025Jul 15, 2025
RaML
Public
[Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
Jupyter Notebook
•2•8•0•0•Updated May 27, 2025May 27, 2025
BotChat
Public
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
Jupyter Notebook
•
Apache License 2.0
•7•162•2•0•Updated May 22, 2025May 22, 2025
Ada-LEval
Public
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
gpt4 llm long-context
gpt4 llm long-context
Python
•3•56•0•0•Updated May 22, 2025May 22, 2025
MathBench
Public
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
Apache License 2.0
•1•115•5•0•Updated May 22, 2025May 22, 2025

ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCompass

All

All

50 repositories

VLMEvalKit

opencompass

PowerBench

CNFinBench

SearchAgentService

Terminal-Bench-server

GTA

SWE-bench-server

GenEditEvalKit

pinchbench_server

TextEdit

MiroFlow

RePro

SAGA

ATLAS

OASIS

InteractScience

CognitiveKernel-Pro

GAOKAO-Eval

.github

MMBench-GUI

ReasonZoo

CompassVerifier

GPassK

Creation-MMBench

CompassJudger

RaML

BotChat

Ada-LEval

MathBench

All

All

Repositories list

50 repositories