An enterprise-oriented benchmark suite for evaluating web agent safety and trustworthiness, featuring 375 tasks across GitLab, SuiteCRM, and ShoppingAdmin with six policy dimensions to measure task completion under compliance constraints. Accepted by ICLR 2025.
ST-WebAgentBench#
Overview#
ST-WebAgentBench is a web agent benchmark focused on Safety & Trustworthiness, addressing the core trust issue in enterprise deployments: agents must complete tasks while operating within organizational policy boundaries.
Core Features#
Task Coverage#
- 375 Enterprise Tasks: GitLab (197), SuiteCRM (170), ShoppingAdmin (8)
- 80 Modal Challenge Tasks: 40 vision-favored + 40 DOM-favored for detecting perception biases
- 3-Tier Difficulty: Easy/Medium/Hard with identical intent but increasing policy load
Six Safety Dimensions#
- Boundary & Scope (959 policies)
- Strict Execution (795)
- User Consent (274)
- Robustness & Security (274)
- Hierarchy Adherence (132)
- Error Handling (118)
Key Metrics#
- CuP (Completion under Policy): Success only when task completed with zero violations
- Risk Ratio: Measures agent behavior risk level
- CR (Completion Rate): Task completion rate
Architecture#
Dual Package Structure#
browsergym/stwebagentbench/: BrowserGym plugin registering Gymnasium environmentsstwebagentbench/: Core implementation (browser env, evaluators, analysis)
Core Components#
custom_env.py: Browser environment with policy enforcementevaluators.py: 9 specialized evaluators (is_ask_the_user, is_url_match, element_action_match, etc.)analyze.py: CR/CuP/Risk Ratio computation and tiered analysispolicy_context.py: Standardized policy prompt formatting
Quick Start#
# Setup environment
uv venv && source .venv/bin/activate
# Install dependencies
uv pip install -e ./browsergym/stwebagentbench
uv pip install playwright==1.52.0
uv run -m playwright install chromium
# Run examples
uv run st_bench_example.py # Default task 47
TASK_ID=235 uv run st_bench_example.py # Specific task
API Usage#
import gymnasium as gym
from browsergym.core.action.highlevel import HighLevelActionSet
from stwebagentbench.policy_context import format_policy_context
import browsergym.stwebagentbench
env = gym.make("browsergym/STWebAgentBenchEnv.235", headless=True)
obs, info = env.reset()
while not done:
policies = format_policy_context(obs.get("policies", []))
# Agent decision logic
action = "finish('Done.')"
obs, reward, terminated, truncated, info = env.step(action)
if "safety_report" in info:
# Handle violation reports
pass
done = terminated or truncated
cup_success = reward == 1.0 and len(violated_policies) == 0
Key Dependencies#
- BrowserGym / Gymnasium: Environment framework
- Playwright: Browser automation
- WebArena AWS AMI: GitLab & ShoppingAdmin backends
- OPENAI_API_KEY: Example LLM agent