AgentCIBench

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

AgentCIBench is an evaluation harness that measures whether computer-use agents (CUAs) respect contextual integrity (CI) when operating across personal applications. It converts everyday cross-app requests into executable, deterministically scored scenarios that target three failure modes: visual co-location, task-ambiguity overshare, and recipient misalignment.

We evaluate 15 frontier agents and find that 11 leak on more than 50% of scenarios, with an average leakage of 67.9% — and the same failures persist when agents act end-to-end in the rendered OpenApps UI.

📄 Paper: arXiv:2606.23189
🤗 Dataset: huggingface.co/datasets/UKPLab/AgentCIBench
🌐 Leaderboard / project page: ukplab.github.io/arxiv2026-agentcibench

config/                Hydra app, task, agent, and defense configs
data/                  Local copy of the benchmark (also mirrored on Hugging Face)
envs/                  Scenario-to-OpenApps visual benchmark bridge
eval/                  Reasoning and visual benchmark runners
mcts/                  Scenario generation engine and CI scoring helpers
scripts/               End-to-end experiment scripts and aggregation tools
src/open_apps/         Local multi-app web environment used by visual runs
tests/                 Focused regression tests
docker/                Optional containerised runtime

Quickstart

Option A — local install (Python 3.11 + `uv`)

git clone https://github.com/UKPLab/arxiv2026-agentcibench.git
cd arxiv2026-agentcibench
uv sync
uv run playwright install chromium

Set provider keys (LiteLLM routes most calls):

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export OPENROUTER_API_KEY=...  # or use direct provider keys

Option B — Docker (recommended for reproducibility)

docker build -t agentcibench -f docker/Dockerfile .
docker run --rm -it \
  -e OPENROUTER_API_KEY="$OPENROUTER_API_KEY" \
  -v "$PWD/data:/app/data" \
  agentcibench bash

See docker/README.md for docker compose, GPU notes, and the headed-browser variant used for visual debugging.

Pull data from Hugging Face (optional)

The repo ships a local copy of the scenarios under data/. To pull the canonical version:

uv run python -c "
from datasets import load_dataset
ds = load_dataset('UKPLab/AgentCIBench')
ds['test'].to_json('data/generated_merged.jsonl')
ds['test_e2e'].to_json('data/eval_set_e2e_50.jsonl')
"

Running the benchmark

Reasoning (state-grounded) benchmark

uv run python -m eval.run_benchmark \
  --generated-dir data/generated_merged \
  --results-dir data/results/text_smoke/<model_slug> \
  --proxy-model openrouter/openai/gpt-5.4-mini \
  --judge-model openrouter/google/gemma-4-31b-it

Full sweep (matches the paper):

USE_OPENROUTER=1 \
MODELS="openai/gpt-5.4 anthropic/claude-sonnet-4.6 deepseek/deepseek-v4-pro" \
scripts/02_text_benchmark.sh data/generated_merged data/results/text

Live UI benchmark

USE_OPENROUTER=1 scripts/05_visual_main.sh data/eval_set_e2e_50 data/results/visual_e2e

Defense ablations

USE_OPENROUTER=1 \
MODELS="openai/gpt-5.4-mini deepseek/deepseek-v4-pro" \
scripts/07_text_defenses.sh data/eval_set_defenses data/results/text_defenses

Regenerate scenarios

USE_OPENROUTER=1 OUTPUT_DIR=data/generated_new \
RUN_LOG_DIR=data/results/mcts_runs_new \
ITERATIONS=35 NODE_EXPANSION_LIMIT=28 \
scripts/01_generate_scenarios.sh

Verifying the install (no paid API calls)

uv run python -m compileall src envs eval mcts prompts.py
uv run pytest tests/test_mcts_phase_a.py tests/test_prompts.py \
  tests/test_proxy_agent.py tests/test_reward_judge.py \
  tests/test_visual_benchmark.py

Submitting to the leaderboard

We host a leaderboard at ukplab.github.io/arxiv2026-agentcibench. To submit your model, open a PR adding a row to leaderboard/models.json with a link to the per-scenario JSONL output produced by eval.run_benchmark.

Citation

@article{goel2026agentcibench,
  title   = {Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?},
  author  = {Goel, Anmol and Gurevych, Iryna},
  journal = {arXiv preprint arXiv:2606.23189},
  year    = {2026}
}

Licensing

Code: Apache License 2.0 (LICENSE)
Data and scenario pool: CC BY 4.0 (see Hugging Face dataset card)
OpenApps environment assets: included synthetic content, released under CC BY 4.0 alongside the data

Responsible use

AgentCIBench targets privacy-failure behaviour by design. The released scenarios are intended for pre-deployment evaluation, regression testing, and mitigation research, not for soliciting harmful outputs. See the Ethical Considerations section of the paper for the full discussion.

Contributing

Issues and PRs are welcome. See CONTRIBUTING.md and the Code of Conduct. For security disclosures, please email anmol.goel@tu-darmstadt.de rather than opening a public issue.

Contact

Anmol Goel — anmol.goel@tu-darmstadt.de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentCIBench

Contents

Quickstart

Option A — local install (Python 3.11 + `uv`)

Option B — Docker (recommended for reproducibility)

Pull data from Hugging Face (optional)

Running the benchmark

Reasoning (state-grounded) benchmark

Live UI benchmark

Defense ablations

Regenerate scenarios

Verifying the install (no paid API calls)

Submitting to the leaderboard

Citation

Licensing

Responsible use

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
data		data
docker		docker
envs		envs
eval		eval
mcts		mcts
scripts		scripts
src/open_apps		src/open_apps
tests		tests
website		website
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
agentcibench-trajectory-video.gif		agentcibench-trajectory-video.gif
launch.py		launch.py
launch_agent.py		launch_agent.py
prompts.py		prompts.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AgentCIBench

Contents

Quickstart

Option A — local install (Python 3.11 + uv)

Option B — Docker (recommended for reproducibility)

Pull data from Hugging Face (optional)

Running the benchmark

Reasoning (state-grounded) benchmark

Live UI benchmark

Defense ablations

Regenerate scenarios

Verifying the install (no paid API calls)

Submitting to the leaderboard

Citation

Licensing

Responsible use

Contributing

Contact

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Option A — local install (Python 3.11 + `uv`)

Packages