Skip to content
Change the repository type filter

All

    Repositories list

    • Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
      Python
      Apache License 2.0
      7234.2k20948Updated Jun 17, 2026Jun 17, 2026
    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…
      Python
      Apache License 2.0
      7917.1k37885Updated Jun 17, 2026Jun 17, 2026
    • 0000Updated Jun 9, 2026Jun 9, 2026
    • CNFinBench — the first comprehensive benchmark for high-stakes financial scenarios. It spans 29 subtasks grounded in authoritative financial corpora and real bu…
      Python
      01600Updated Jun 9, 2026Jun 9, 2026
    • Python
      1000Updated May 22, 2026May 22, 2026
    • Shell
      1000Updated Apr 28, 2026Apr 28, 2026
    • GTA

      Public
      [NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2
      Python
      Apache License 2.0
      1014510Updated Apr 20, 2026Apr 20, 2026
    • Python
      0100Updated Apr 15, 2026Apr 15, 2026
    • The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.
      Jupyter Notebook
      MIT License
      44600Updated Apr 12, 2026Apr 12, 2026
    • Python
      0000Updated Apr 3, 2026Apr 3, 2026
    • TextEdit

      Public
      We provide TextEdit, a high-quality, multi-scenario text editing benchmark for generation models.
      Python
      MIT License
      02000Updated Mar 16, 2026Mar 16, 2026
    • MiroFlow

      Public
      MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.
      Python
      Apache License 2.0
      319000Updated Dec 30, 2025Dec 30, 2025
    • RePro

      Public
      [ICLR 2026] Rectifying LLM Thought From Lens of Optimization
      Python
      MIT License
      41510Updated Dec 5, 2025Dec 5, 2025
    • SAGA

      Public
      The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."
      01110Updated Nov 27, 2025Nov 27, 2025
    • ATLAS

      Public
      ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
      2700Updated Nov 20, 2025Nov 20, 2025
    • OASIS

      Public
      Python
      0300Updated Nov 12, 2025Nov 12, 2025
    • JavaScript
      Apache License 2.0
      0800Updated Oct 31, 2025Oct 31, 2025
    • Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414
      Python
      Other
      56000Updated Oct 27, 2025Oct 27, 2025
    • Jupyter Notebook
      712250Updated Oct 7, 2025Oct 7, 2025
    • .github

      Public
      1000Updated Sep 9, 2025Sep 9, 2025
    • Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical mann…
      Python
      511260Updated Sep 8, 2025Sep 8, 2025
    • ReasonZoo

      Public
      Python
      Apache License 2.0
      0300Updated Aug 27, 2025Aug 27, 2025
    • [EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
      Jupyter Notebook
      26900Updated Aug 10, 2025Aug 10, 2025
    • GPassK

      Public
      [ACL 2025] Are Your LLMs Capable of Stable Reasoning?
      Python
      23320Updated Aug 5, 2025Aug 5, 2025
    • Assessing Context-Aware Creative Intelligence in MLLMs
      JavaScript
      02310Updated Jul 22, 2025Jul 22, 2025
    • The All-in-one Judge Models introduced by Opencompass
      Apache License 2.0
      611910Updated Jul 15, 2025Jul 15, 2025
    • RaML

      Public
      [Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
      Jupyter Notebook
      2800Updated May 27, 2025May 27, 2025
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      Apache License 2.0
      716220Updated May 22, 2025May 22, 2025
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      35600Updated May 22, 2025May 22, 2025
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      Apache License 2.0
      111550Updated May 22, 2025May 22, 2025
    ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.