DISCOVER THE FUTURE OF AI AGENTSarrow_forward

Rhesis

calendar_todayAdded Feb 25, 2026
categoryAgent & Tooling
codeOpen Source
PythonDocker大语言模型FastAPIRAGAI AgentsWeb ApplicationSDKCLIAgent & ToolingModel & Inference FrameworkDeveloper Tools & CodingKnowledge Management, Retrieval & RAGSecurity & Privacy

Open-source platform for testing LLM and agentic apps with AI-powered test generation, adversarial red-teaming, and 60+ evaluation metrics for RAG hallucination detection and conversation consistency verification.

Overview#

Rhesis is an open-source LLM and agentic application testing platform developed by Rhesis AI GmbH (Potsdam, Germany). It addresses the challenges of behavioral uncertainty, security vulnerabilities (such as prompt injection, PII leakage), and the lack of systematic testing methods in LLM applications and Agentic systems during development and deployment.

Core Capabilities#

Test Generation#

  • AI-Powered Synthesis: Automatically generate hundreds of test scenarios based on natural language requirements, covering edge cases and adversarial prompts
  • Knowledge-Aware: Optimize test generation by connecting context sources (Notion, GitHub, Jira, Confluence) via file upload or MCP
  • Support for single-turn testing and complex conversation flow simulation

Adversarial Testing (Red-Teaming)#

Polyphemus Agent actively discovers security vulnerabilities:

  • Jailbreak attempts and prompt injection detection
  • PII leakage and data extraction testing
  • Harmful content generation detection
  • Role violation and instruction bypass testing
  • Built-in Garak LLM vulnerability scanner integration

Conversation Simulation#

Penelope Agent simulates real user conversations to test:

  • Context retention capabilities
  • Role consistency
  • Conversation coherence

Evaluation Metrics System (60+ Pre-built Metrics)#

  • RAGAS: Context relevance, faithfulness, answer accuracy
  • DeepEval: Bias, toxicity, PII leakage, role violations, turn relevance, knowledge retention
  • Garak: Jailbreak detection, prompt injection, XSS, malware generation, data leakage
  • Custom: NumericJudge, CategoricalJudge for domain-specific evaluation
  • All metrics include LLM-as-Judge reasoning explanations

Observability & Tracing#

  • OpenTelemetry-based tracing system
  • Monitor LLM calls, latency, token usage
  • Correlate traces with test results for debugging

Architecture Design#

  • Frontend: TypeScript/React application (in apps/ directory)
  • Backend: Python services (in apps/ and sdk/ directories)
  • Data Layer: Supports Docker local database deployment
  • Infrastructure: Terraform configuration provided, supports Kubernetes deployment

Core Components#

  1. Polyphemus Agent: Adversarial testing agent
  2. Penelope Agent: Conversation simulation agent
  3. Synthesizers: Test scenario generators (e.g., PromptSynthesizer)
  4. Metrics Framework: Evaluation metric execution engine
  5. Tracing Module: OpenTelemetry tracing module

Test Lifecycle#

  1. Projects: Configure AI application, upload context sources, set up SDK connectors
  2. Requirements: Define expected behaviors (should and should not do)
  3. Metrics: Select pre-built metrics or create custom evaluations
  4. Tests: Generate single-turn and conversation simulation test scenarios
  5. Execution: Run tests via UI/SDK/API, integrate with CI/CD
  6. Collaboration: Team collaboration through comments, tasks, workflows

Installation & Deployment#

Cloud Platform (Fastest)#

Visit https://app.rhesis.ai to create a free account

Python SDK#

pip install rhesis-sdk

Docker Local Deployment (Zero Configuration)#

git clone https://github.com/rhesis-ai/rhesis.git
cd rhesis
./rh start

SDK Usage Examples#

Endpoint Decorator#

from rhesis.sdk.decorators import endpoint

@endpoint(name="my-chatbot")
def chat(message: str) -> str:
    # Your LLM logic here
    return response

Observability Tracing#

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Test Generator#

from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
    prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)

Use Cases#

  • Conversational AI: Dialogue simulation, role consistency, and knowledge retention testing
  • RAG Systems: Context relevance, faithfulness, and hallucination detection
  • NL-to-SQL / NL-to-Code: Query accuracy, syntax validation, and edge case handling
  • Agentic Systems: Tool selection, goal achievement, and multi-agent coordination

Target Users#

  • Development Engineers: Code-first SDK integration
  • Product Managers: Define requirements and expected behaviors
  • Domain Experts: Review test results
  • Legal and Compliance Teams: Ensure compliance requirements

Model & Framework Integration#

Cloud Services: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

Framework Integration: LangChain, LangGraph, AutoGen, LiteLLM, Google Gemini, Ollama, OpenRouter, Vertex AI, HuggingFace, REST API

Project Activity#

  • 3,805 commits
  • 86 releases
  • 18+ contributors
  • Primary languages: Python (67.7%), TypeScript (27.6%)
  • MIT License (core features)

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch