ST-WebAgentBench

ST-WebAgentBench#

Overview#

ST-WebAgentBench is a web agent benchmark focused on Safety & Trustworthiness, addressing the core trust issue in enterprise deployments: agents must complete tasks while operating within organizational policy boundaries.

Core Features#

Task Coverage#

375 Enterprise Tasks: GitLab (197), SuiteCRM (170), ShoppingAdmin (8)
80 Modal Challenge Tasks: 40 vision-favored + 40 DOM-favored for detecting perception biases
3-Tier Difficulty: Easy/Medium/Hard with identical intent but increasing policy load

Six Safety Dimensions#

Boundary & Scope (959 policies)
Strict Execution (795)
User Consent (274)
Robustness & Security (274)
Hierarchy Adherence (132)
Error Handling (118)

Key Metrics#

CuP (Completion under Policy): Success only when task completed with zero violations
Risk Ratio: Measures agent behavior risk level
CR (Completion Rate): Task completion rate

Architecture#

Dual Package Structure#

browsergym/stwebagentbench/: BrowserGym plugin registering Gymnasium environments
stwebagentbench/: Core implementation (browser env, evaluators, analysis)

Core Components#

custom_env.py: Browser environment with policy enforcement
evaluators.py: 9 specialized evaluators (is_ask_the_user, is_url_match, element_action_match, etc.)
analyze.py: CR/CuP/Risk Ratio computation and tiered analysis
policy_context.py: Standardized policy prompt formatting

Quick Start#

# Setup environment
uv venv && source .venv/bin/activate

# Install dependencies
uv pip install -e ./browsergym/stwebagentbench
uv pip install playwright==1.52.0
uv run -m playwright install chromium

# Run examples
uv run st_bench_example.py              # Default task 47
TASK_ID=235 uv run st_bench_example.py  # Specific task

API Usage#

import gymnasium as gym
from browsergym.core.action.highlevel import HighLevelActionSet
from stwebagentbench.policy_context import format_policy_context
import browsergym.stwebagentbench

env = gym.make("browsergym/STWebAgentBenchEnv.235", headless=True)
obs, info = env.reset()

while not done:
    policies = format_policy_context(obs.get("policies", []))
    # Agent decision logic
    action = "finish('Done.')"
    obs, reward, terminated, truncated, info = env.step(action)
    if "safety_report" in info:
        # Handle violation reports
        pass
    done = terminated or truncated

cup_success = reward == 1.0 and len(violated_policies) == 0

Key Dependencies#

BrowserGym / Gymnasium: Environment framework
Playwright: Browser automation
WebArena AWS AMI: GitLab & ShoppingAdmin backends
OPENAI_API_KEY: Example LLM agent