DISCOVER THE FUTURE OF AI AGENTSarrow_forward

ST-WebAgentBench

calendar_todayAdded Feb 25, 2026
categoryAgent & Tooling
codeOpen Source
PythonDocker大语言模型PlaywrightAI AgentsBrowser AutomationAgent & ToolingModel & Inference FrameworkSecurity & PrivacyEnterprise Applications & Office

An enterprise-oriented benchmark suite for evaluating web agent safety and trustworthiness, featuring 375 tasks across GitLab, SuiteCRM, and ShoppingAdmin with six policy dimensions to measure task completion under compliance constraints. Accepted by ICLR 2025.

ST-WebAgentBench#

Overview#

ST-WebAgentBench is a web agent benchmark focused on Safety & Trustworthiness, addressing the core trust issue in enterprise deployments: agents must complete tasks while operating within organizational policy boundaries.

Core Features#

Task Coverage#

  • 375 Enterprise Tasks: GitLab (197), SuiteCRM (170), ShoppingAdmin (8)
  • 80 Modal Challenge Tasks: 40 vision-favored + 40 DOM-favored for detecting perception biases
  • 3-Tier Difficulty: Easy/Medium/Hard with identical intent but increasing policy load

Six Safety Dimensions#

  1. Boundary & Scope (959 policies)
  2. Strict Execution (795)
  3. User Consent (274)
  4. Robustness & Security (274)
  5. Hierarchy Adherence (132)
  6. Error Handling (118)

Key Metrics#

  • CuP (Completion under Policy): Success only when task completed with zero violations
  • Risk Ratio: Measures agent behavior risk level
  • CR (Completion Rate): Task completion rate

Architecture#

Dual Package Structure#

  • browsergym/stwebagentbench/: BrowserGym plugin registering Gymnasium environments
  • stwebagentbench/: Core implementation (browser env, evaluators, analysis)

Core Components#

  • custom_env.py: Browser environment with policy enforcement
  • evaluators.py: 9 specialized evaluators (is_ask_the_user, is_url_match, element_action_match, etc.)
  • analyze.py: CR/CuP/Risk Ratio computation and tiered analysis
  • policy_context.py: Standardized policy prompt formatting

Quick Start#

# Setup environment
uv venv && source .venv/bin/activate

# Install dependencies
uv pip install -e ./browsergym/stwebagentbench
uv pip install playwright==1.52.0
uv run -m playwright install chromium

# Run examples
uv run st_bench_example.py              # Default task 47
TASK_ID=235 uv run st_bench_example.py  # Specific task

API Usage#

import gymnasium as gym
from browsergym.core.action.highlevel import HighLevelActionSet
from stwebagentbench.policy_context import format_policy_context
import browsergym.stwebagentbench

env = gym.make("browsergym/STWebAgentBenchEnv.235", headless=True)
obs, info = env.reset()

while not done:
    policies = format_policy_context(obs.get("policies", []))
    # Agent decision logic
    action = "finish('Done.')"
    obs, reward, terminated, truncated, info = env.step(action)
    if "safety_report" in info:
        # Handle violation reports
        pass
    done = terminated or truncated

cup_success = reward == 1.0 and len(violated_policies) == 0

Key Dependencies#

  • BrowserGym / Gymnasium: Environment framework
  • Playwright: Browser automation
  • WebArena AWS AMI: GitLab & ShoppingAdmin backends
  • OPENAI_API_KEY: Example LLM agent

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch