Open-source platform for testing LLM and agentic apps with AI-powered test generation, adversarial red-teaming, and 60+ evaluation metrics for RAG hallucination detection and conversation consistency verification.

Overview#

Rhesis is an open-source LLM and agentic application testing platform developed by Rhesis AI GmbH (Potsdam, Germany). It addresses the challenges of behavioral uncertainty, security vulnerabilities (such as prompt injection, PII leakage), and the lack of systematic testing methods in LLM applications and Agentic systems during development and deployment.

Core Capabilities#

Test Generation#

AI-Powered Synthesis: Automatically generate hundreds of test scenarios based on natural language requirements, covering edge cases and adversarial prompts
Knowledge-Aware: Optimize test generation by connecting context sources (Notion, GitHub, Jira, Confluence) via file upload or MCP
Support for single-turn testing and complex conversation flow simulation

Adversarial Testing (Red-Teaming)#

Polyphemus Agent actively discovers security vulnerabilities:

Jailbreak attempts and prompt injection detection
PII leakage and data extraction testing
Harmful content generation detection
Role violation and instruction bypass testing
Built-in Garak LLM vulnerability scanner integration

Conversation Simulation#

Penelope Agent simulates real user conversations to test:

Context retention capabilities
Role consistency
Conversation coherence

Evaluation Metrics System (60+ Pre-built Metrics)#

RAGAS: Context relevance, faithfulness, answer accuracy
DeepEval: Bias, toxicity, PII leakage, role violations, turn relevance, knowledge retention
Garak: Jailbreak detection, prompt injection, XSS, malware generation, data leakage
Custom: NumericJudge, CategoricalJudge for domain-specific evaluation
All metrics include LLM-as-Judge reasoning explanations

Observability & Tracing#

OpenTelemetry-based tracing system
Monitor LLM calls, latency, token usage
Correlate traces with test results for debugging

Architecture Design#

Frontend: TypeScript/React application (in apps/ directory)
Backend: Python services (in apps/ and sdk/ directories)
Data Layer: Supports Docker local database deployment
Infrastructure: Terraform configuration provided, supports Kubernetes deployment

Core Components#

Polyphemus Agent: Adversarial testing agent
Penelope Agent: Conversation simulation agent
Synthesizers: Test scenario generators (e.g., PromptSynthesizer)
Metrics Framework: Evaluation metric execution engine
Tracing Module: OpenTelemetry tracing module

Test Lifecycle#

Projects: Configure AI application, upload context sources, set up SDK connectors
Requirements: Define expected behaviors (should and should not do)
Metrics: Select pre-built metrics or create custom evaluations
Tests: Generate single-turn and conversation simulation test scenarios
Execution: Run tests via UI/SDK/API, integrate with CI/CD
Collaboration: Team collaboration through comments, tasks, workflows

pip install rhesis-sdk

Docker Local Deployment (Zero Configuration)#

git clone https://github.com/rhesis-ai/rhesis.git
cd rhesis
./rh start

Access: http://localhost:3000 (auto login)
API Docs: http://localhost:8080/docs
Management commands: ./rh logs / ./rh stop / ./rh restart

SDK Usage Examples#

Endpoint Decorator#

from rhesis.sdk.decorators import endpoint

@endpoint(name="my-chatbot")
def chat(message: str) -> str:
    # Your LLM logic here
    return response

Observability Tracing#

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Test Generator#

from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
    prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)

Use Cases#

Conversational AI: Dialogue simulation, role consistency, and knowledge retention testing
RAG Systems: Context relevance, faithfulness, and hallucination detection
NL-to-SQL / NL-to-Code: Query accuracy, syntax validation, and edge case handling
Agentic Systems: Tool selection, goal achievement, and multi-agent coordination

Target Users#

Development Engineers: Code-first SDK integration
Product Managers: Define requirements and expected behaviors
Domain Experts: Review test results
Legal and Compliance Teams: Ensure compliance requirements

Model & Framework Integration#

Cloud Services: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

Framework Integration: LangChain, LangGraph, AutoGen, LiteLLM, Google Gemini, Ollama, OpenRouter, Vertex AI, HuggingFace, REST API

Project Activity#

3,805 commits
86 releases
18+ contributors
Primary languages: Python (67.7%), TypeScript (27.6%)
MIT License (core features)

Rhesis