Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?
AgentCIBench is an evaluation harness that measures whether computer-use agents (CUAs) respect contextual integrity (CI) when operating across personal applications. It converts everyday cross-app requests into executable, deterministically scored scenarios that target three failure modes: visual co-location, task-ambiguity overshare, and recipient misalignment.
We evaluate 15 frontier agents and find that 11 leak on more than 50% of scenarios, with an average leakage of 67.9% β and the same failures persist when agents act end-to-end in the rendered OpenApps UI.
- π Paper: arXiv:2606.23189
- π€ Dataset: huggingface.co/datasets/UKPLab/AgentCIBench
- π Leaderboard / project page: ukplab.github.io/arxiv2026-agentcibench
config/ Hydra app, task, agent, and defense configs
data/ Local copy of the benchmark (also mirrored on Hugging Face)
envs/ Scenario-to-OpenApps visual benchmark bridge
eval/ Reasoning and visual benchmark runners
mcts/ Scenario generation engine and CI scoring helpers
scripts/ End-to-end experiment scripts and aggregation tools
src/open_apps/ Local multi-app web environment used by visual runs
tests/ Focused regression tests
docker/ Optional containerised runtime
git clone https://github.com/UKPLab/arxiv2026-agentcibench.git
cd arxiv2026-agentcibench
uv sync
uv run playwright install chromiumSet provider keys (LiteLLM routes most calls):
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export OPENROUTER_API_KEY=... # or use direct provider keysdocker build -t agentcibench -f docker/Dockerfile .
docker run --rm -it \
-e OPENROUTER_API_KEY="$OPENROUTER_API_KEY" \
-v "$PWD/data:/app/data" \
agentcibench bashSee docker/README.md for docker compose, GPU notes,
and the headed-browser variant used for visual debugging.
The repo ships a local copy of the scenarios under data/. To pull the
canonical version:
uv run python -c "
from datasets import load_dataset
ds = load_dataset('UKPLab/AgentCIBench')
ds['test'].to_json('data/generated_merged.jsonl')
ds['test_e2e'].to_json('data/eval_set_e2e_50.jsonl')
"uv run python -m eval.run_benchmark \
--generated-dir data/generated_merged \
--results-dir data/results/text_smoke/<model_slug> \
--proxy-model openrouter/openai/gpt-5.4-mini \
--judge-model openrouter/google/gemma-4-31b-itFull sweep (matches the paper):
USE_OPENROUTER=1 \
MODELS="openai/gpt-5.4 anthropic/claude-sonnet-4.6 deepseek/deepseek-v4-pro" \
scripts/02_text_benchmark.sh data/generated_merged data/results/textUSE_OPENROUTER=1 scripts/05_visual_main.sh data/eval_set_e2e_50 data/results/visual_e2eUSE_OPENROUTER=1 \
MODELS="openai/gpt-5.4-mini deepseek/deepseek-v4-pro" \
scripts/07_text_defenses.sh data/eval_set_defenses data/results/text_defensesUSE_OPENROUTER=1 OUTPUT_DIR=data/generated_new \
RUN_LOG_DIR=data/results/mcts_runs_new \
ITERATIONS=35 NODE_EXPANSION_LIMIT=28 \
scripts/01_generate_scenarios.shuv run python -m compileall src envs eval mcts prompts.py
uv run pytest tests/test_mcts_phase_a.py tests/test_prompts.py \
tests/test_proxy_agent.py tests/test_reward_judge.py \
tests/test_visual_benchmark.pyWe host a leaderboard at ukplab.github.io/arxiv2026-agentcibench.
To submit your model, open a PR adding a row to leaderboard/models.json with a
link to the per-scenario JSONL output produced by eval.run_benchmark.
@article{goel2026agentcibench,
title = {Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?},
author = {Goel, Anmol and Gurevych, Iryna},
journal = {arXiv preprint arXiv:2606.23189},
year = {2026}
}- Code: Apache License 2.0 (
LICENSE) - Data and scenario pool: CC BY 4.0 (see Hugging Face dataset card)
- OpenApps environment assets: included synthetic content, released under CC BY 4.0 alongside the data
AgentCIBench targets privacy-failure behaviour by design. The released scenarios are intended for pre-deployment evaluation, regression testing, and mitigation research, not for soliciting harmful outputs. See the Ethical Considerations section of the paper for the full discussion.
Issues and PRs are welcome. See CONTRIBUTING.md and the
Code of Conduct. For security disclosures, please
email anmol.goel@tu-darmstadt.de rather than opening a public issue.
- Anmol Goel β
anmol.goel@tu-darmstadt.de
