ST-WebAgentBench
✨An enterprise-oriented benchmark suite for evaluating web agent safety and trustworthiness, featuring 375 tasks across GitLab, SuiteCRM, and ShoppingAdmin with six policy dimensions to measure task completion under compliance constraints. Accepted by ICLR 2025.
PythonDocker大语言模型