Skip to content

fastino-ai/PROBE_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents (PROBE benchmark)

Paper Dataset License

Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, Ash Lewis


Overview

PROBE Task Setup

PROBE (Proactive Resolution of Bottlenecks) evaluates AI agents on proactivity: anticipating user needs from continuous observation and solving problems autonomously across realistic, multi-document workplace scenarios. Rather than awaiting explicit instruction, proactive agents must search for unspecified issues, identify critical bottlenecks, and execute appropriate resolutions.

The model (or agent) uses the world model to: (i) search over the user datastore, (ii) identify the bottleneck, and (iii) select the action to execute. The model is evaluated across all three tasks in this pipeline.

🎯 Three Core Capabilities

  1. πŸ” Search β€” Finding relevant information across a personalized datastore
  2. 🎯 Identification β€” Detecting specific bottlenecks in retrieved content
  3. βœ… Resolution β€” Executing appropriate actions to resolve issues

πŸ“Š Benchmark Stats

  • 1,000 test samples across easy/medium/hard difficulty
  • Authentic personas from LinkedIn professional profiles
  • Multi-document context with emails, calendars, and documents
  • Capability ceiling: Even state-of-the-art models achieve only 40% end-to-end success (GPT-5, Claude Opus 4.1), revealing substantial challenges in developing truly proactive AI systems

πŸ“¦ Installation

git clone https://github.com/fastino/PROBE.git
cd PROBE
pip install -e .

# Set up API key
echo "OPENAI_API_KEY=your_api_key_here" > .env

Requirements: Python 3.12+, OpenAI API key

Optional extras:

pip install -e ".[baselines]"  # For agentic baselines (ReAct, Reflexion, ReWOO)
pip install -e ".[ui]"          # For annotation interface
pip install -e ".[all]"         # Everything

πŸš€ Quick Start

1️⃣ Get Benchmark Data

Option A: Use Pre-generated Dataset (Recommended)

Download the complete 1,000-sample benchmark from Hugging Face:

pip install huggingface-hub
huggingface-cli download gilfastino/PROBE --repo-type dataset --local-dir data/probe_benchmark

# Extract the files
cd data/probe_benchmark
unzip sept_23_1000_inputs_20250923_131956.zip
unzip sept_23_1000_outputs_20250923_131956.zip

Option B: Generate Custom Data

python run.py --mode batch --count 5 --difficulty medium

Output: generated_data/TIMESTAMP_batch/inputs/ and outputs/

2️⃣ Run Inference

Option A: LLM Baseline (single-pass, native batch APIs)

python baselines/llm/run_native_batch_evaluation.py \
  --models gpt-5 claude-opus-4-1-20250805 \
  --data_dir generated_data/TIMESTAMP_batch/inputs

Option B: Agentic Baseline (multi-step reasoning)

# Run with a single agent and model
python baselines/agentic/inference.py \
  --data_dir generated_data/TIMESTAMP_batch/inputs \
  --output_dir results/my_experiment \
  --model gpt-4.1 \
  --agent react_agent

# To test multiple models simultaneously
python baselines/agentic/inference.py \
  --data_dir generated_data/TIMESTAMP_batch/inputs \
  --output_dir results/my_experiment \
  --models gpt-4.1 gpt-5 \
  --agent reflexion_agent \
  --multi_model

3️⃣ Evaluate

# Basic evaluation with exact matching
python evaluation/batch_evaluate_baselines.py \
  --predictions-dir results/my_experiment/inputs \
  --labels-dir generated_data/TIMESTAMP_batch/outputs

# With LLM judge for semantic similarity (recommended)
python evaluation/batch_evaluate_baselines.py \
  --predictions-dir results/my_experiment/inputs \
  --labels-dir generated_data/TIMESTAMP_batch/outputs \
  --use-llm-judge \
  --output-file results/evaluation.json

Arguments:

  • --predictions-dir: Directory containing model predictions (with subdirs per model, e.g., react_agent_gpt-4.1/)
  • --labels-dir: Directory containing ground truth labels (*_output.json files)
  • --inputs-dir: (Optional) Directory containing input files. Defaults to labels-dir/../inputs
  • --use-llm-judge: Enable LLM-based semantic matching for bottleneck identification
  • --output-file: Where to save results (default: auto-generated timestamp)

Supported Prediction Formats:

  • Individual JSON files (one per example): model_name/example_id_results.json
  • Batch API JSONL files: model_name/combined_batches.jsonl (automatically extracts predictions from Anthropic/OpenAI batch formats)

πŸ“‚ Repository Structure

PROBE/
β”œβ”€β”€ run.py                    # Data generation entry point
β”œβ”€β”€ configs/                  # Configuration schemas
β”œβ”€β”€ data_generation/          # Generation pipeline
β”‚   β”œβ”€β”€ generators/           # World model, bottleneck, checklist, evidence
β”‚   β”œβ”€β”€ prompts/              # Jinja2 templates
β”‚   └── data/                 # LinkedIn personas
β”œβ”€β”€ baselines/
β”‚   β”œβ”€β”€ agentic/              # ReAct, Reflexion, ReWOO
β”‚   └── llm/                  # Direct LLM (batch APIs)
β”œβ”€β”€ evaluation/               # Scoring and evaluation
└── generated_data/           # Benchmark outputs

βš™οΈ Configuration

Data Generation

CLI:

python run.py \
  --mode batch \
  --count 100 \
  --difficulty hard \
  --workers 8 \

YAML config:

mode: batch
count: 100
difficulty: medium
generate_distractors: true
distractor_count: 10
max_workers: 8

Run: python run.py --config config.yaml

Environment Variables

# Required
OPENAI_API_KEY=sk-...          # For OpenAI models (data generation & evaluation)

# Optional (for specific models)
ANTHROPIC_API_KEY=sk-ant-...   # For Claude models
GOOGLE_API_KEY=...             # For Gemini models
HUGGINGFACE_TOKEN=hf_...       # For LinkedIn persona dataset access

πŸ“Š Evaluation

Metrics

PROBE evaluates three core capabilities:

Component What it tests Metric
πŸ” Search Retrieved relevant documents F1 score (precision Γ— recall)
🎯 Identification Detected correct bottleneck Exact match or LLM-judge similarity
βœ… Resolution Selected right action + parameters Action match + parameter accuracy
πŸ“ˆ Overall End-to-end performance Average of all three components

Running Evaluation

# Evaluate all models in predictions directory
python evaluation/batch_evaluate_baselines.py \
  --predictions-dir results/my_experiment/inputs \
  --labels-dir generated_data/TIMESTAMP_batch/outputs \
  --use-llm-judge \
  --output-file results/evaluation.json

# Evaluate specific models only
python evaluation/batch_evaluate_baselines.py \
  --predictions-dir results/my_experiment/inputs \
  --labels-dir generated_data/TIMESTAMP_batch/outputs \
  --models react_agent_gpt-4.1 baseline_agent_gpt-5 \
  --use-llm-judge

The script outputs:

  • JSON file with detailed per-example scores
  • CSV summary with aggregate statistics
  • Leaderboard printed to console

πŸ€– Adding Custom Baselines

Agentic Baseline

from baselines.agentic.base_agent import BaseAgent

class MyAgent(BaseAgent):
    def run(self, world_model, corpus):
        # 1. Search
        docs = self.retrieve_documents(corpus)
        
        # 2. Identify
        bottleneck = self.identify_bottleneck(docs, world_model)
        
        # 3. Resolve
        action = self.select_action(bottleneck, world_model)
        
        return {
            "retrieved_documents": [d["id"] for d in docs],
            "bottleneck": bottleneck,
            "action": {"function_name": action["name"], "parameters": action["params"]}
        }

Register in baselines/agentic/inference.py and run with --agent myagent.


πŸ“š Citation

If you use PROBE in your research, please cite our paper:

@article{pasternak2025probe,
  title={Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents},
  author={Pasternak, Gil and Rajagopal, Dheeraj and White, Julia and Atreja, Dhruv and Thomas, Matthew and Hurn-Maloney, George and Lewis, Ash},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

🀝 Contributing

We welcome contributions!

Areas:

  • πŸ€– New baselines
  • πŸ”§ Additional bottleneck types
  • 🎭 Better distractor generation
  • πŸ“Š New evaluation metrics

Process:

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/amazing-feature
  3. Make changes with tests
  4. Run linters: black, mypy
  5. Submit PR

πŸ“– Additional Documentation


πŸ”§ Troubleshooting

Rate limits: Adjust --workers (generation) or --max_workers (inference)

Memory issues: Use --count 10 and --no-distractors for testing

Import errors: pip install -e . or pip install -e ".[baselines]"

Evaluation issues: Ensure directory has inputs/ and outputs/ subdirectories


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright (c) Fastino Inc 2025

Citation & Usage

If you use PROBE in your research, please cite our paper (see Citation section above).

Contact: GitHub Issues for bug reports β€’ GitHub Discussions for questions


Built for advancing proactive AI research πŸš€

About

Code for the paper "Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors