Open-source platform for testing LLM and agentic apps with AI-powered test generation, adversarial red-teaming, and 60+ evaluation metrics for RAG hallucination detection and conversation consistency verification.
Overview#
Rhesis is an open-source LLM and agentic application testing platform developed by Rhesis AI GmbH (Potsdam, Germany). It addresses the challenges of behavioral uncertainty, security vulnerabilities (such as prompt injection, PII leakage), and the lack of systematic testing methods in LLM applications and Agentic systems during development and deployment.
Core Capabilities#
Test Generation#
- AI-Powered Synthesis: Automatically generate hundreds of test scenarios based on natural language requirements, covering edge cases and adversarial prompts
- Knowledge-Aware: Optimize test generation by connecting context sources (Notion, GitHub, Jira, Confluence) via file upload or MCP
- Support for single-turn testing and complex conversation flow simulation
Adversarial Testing (Red-Teaming)#
Polyphemus Agent actively discovers security vulnerabilities:
- Jailbreak attempts and prompt injection detection
- PII leakage and data extraction testing
- Harmful content generation detection
- Role violation and instruction bypass testing
- Built-in Garak LLM vulnerability scanner integration
Conversation Simulation#
Penelope Agent simulates real user conversations to test:
- Context retention capabilities
- Role consistency
- Conversation coherence
Evaluation Metrics System (60+ Pre-built Metrics)#
- RAGAS: Context relevance, faithfulness, answer accuracy
- DeepEval: Bias, toxicity, PII leakage, role violations, turn relevance, knowledge retention
- Garak: Jailbreak detection, prompt injection, XSS, malware generation, data leakage
- Custom: NumericJudge, CategoricalJudge for domain-specific evaluation
- All metrics include LLM-as-Judge reasoning explanations
Observability & Tracing#
- OpenTelemetry-based tracing system
- Monitor LLM calls, latency, token usage
- Correlate traces with test results for debugging
Architecture Design#
- Frontend: TypeScript/React application (in
apps/directory) - Backend: Python services (in
apps/andsdk/directories) - Data Layer: Supports Docker local database deployment
- Infrastructure: Terraform configuration provided, supports Kubernetes deployment
Core Components#
- Polyphemus Agent: Adversarial testing agent
- Penelope Agent: Conversation simulation agent
- Synthesizers: Test scenario generators (e.g., PromptSynthesizer)
- Metrics Framework: Evaluation metric execution engine
- Tracing Module: OpenTelemetry tracing module
Test Lifecycle#
- Projects: Configure AI application, upload context sources, set up SDK connectors
- Requirements: Define expected behaviors (should and should not do)
- Metrics: Select pre-built metrics or create custom evaluations
- Tests: Generate single-turn and conversation simulation test scenarios
- Execution: Run tests via UI/SDK/API, integrate with CI/CD
- Collaboration: Team collaboration through comments, tasks, workflows
Installation & Deployment#
Cloud Platform (Fastest)#
Visit https://app.rhesis.ai to create a free account
Python SDK#
pip install rhesis-sdk
Docker Local Deployment (Zero Configuration)#
git clone https://github.com/rhesis-ai/rhesis.git
cd rhesis
./rh start
- Access: http://localhost:3000 (auto login)
- API Docs: http://localhost:8080/docs
- Management commands:
./rh logs/./rh stop/./rh restart
SDK Usage Examples#
Endpoint Decorator#
from rhesis.sdk.decorators import endpoint
@endpoint(name="my-chatbot")
def chat(message: str) -> str:
# Your LLM logic here
return response
Observability Tracing#
from rhesis.sdk.decorators import observe
@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
# Your LLM call here
return response
Test Generator#
from rhesis.sdk.synthesizers import PromptSynthesizer
synthesizer = PromptSynthesizer(
prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)
Use Cases#
- Conversational AI: Dialogue simulation, role consistency, and knowledge retention testing
- RAG Systems: Context relevance, faithfulness, and hallucination detection
- NL-to-SQL / NL-to-Code: Query accuracy, syntax validation, and edge case handling
- Agentic Systems: Tool selection, goal achievement, and multi-agent coordination
Target Users#
- Development Engineers: Code-first SDK integration
- Product Managers: Define requirements and expected behaviors
- Domain Experts: Review test results
- Legal and Compliance Teams: Ensure compliance requirements
Model & Framework Integration#
Cloud Services: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI
Local/Self-hosted: Ollama, vLLM, LiteLLM
Framework Integration: LangChain, LangGraph, AutoGen, LiteLLM, Google Gemini, Ollama, OpenRouter, Vertex AI, HuggingFace, REST API
Project Activity#
- 3,805 commits
- 86 releases
- 18+ contributors
- Primary languages: Python (67.7%), TypeScript (27.6%)
- MIT License (core features)