Fully automatic censorship removal tool for language models using directional ablation with TPE parameter optimization to remove safety alignment while minimizing refusal behaviors and preserving original model capabilities. Supports dense, multimodal, and MoE architectures.
Heretic is an automated censorship removal tool for Transformer-based Large Language Models (LLMs). Its core function is to automatically identify and ablate the "refusal direction" in models, thereby removing safety alignment mechanisms and enabling responses to prompts that would otherwise be refused, without expensive post-training.
- Fully automatic process, no manual configuration or expensive post-training required
- Based on directional ablation (abliteration) techniques, referencing Arditi et al. 2024, Lai 2025 research
- Automatic search for optimal ablation parameters via Optuna's Tree-structured Parzen Estimator
- Default 200 optimization trials, with first 60 as random sampling exploration phase
- Simultaneously minimize refusal count and KL divergence from original model
- Preserve original model intelligence while removing censorship
- Flexible Ablation Weight Kernel: Highly configurable kernel shape with automatic parameter optimization for better compliance/quality tradeoff
- Float Refusal Direction Index: Supports float indices for linear interpolation between two nearest refusal direction vectors
- Component-wise Ablation Parameters: Different ablation weights for attention and MLP components (MLP intervention typically more damaging to model)
| Type | Support Status |
|---|
| Dense Models | ✅ Most mainstream dense Transformer models |
| Multimodal Models | ✅ Multimodal architectures supported |
| MoE Architectures | ✅ Various Mixture of Experts architectures |
| SSM/Hybrid Models | ❌ Not supported |
| Non-homogeneous Layer Models | ❌ Not supported |
| Some Novel Attention Systems | ❌ Not supported |
Benchmark comparison using gemma-3-12b-it:
| Model Variant | "Harmful" Prompt Refusals/100 | KL Divergence ("Harmless" prompts) |
|---|
| google/gemma-3-12b-it (original) | 97/100 | 0 (baseline) |
| mlabonne/gemma-3-12b-it-abliterated-v2 | 3/100 | 1.04 |
| huihui-ai/gemma-3-12b-it-abliterated | 3/100 | 0.45 |
| p-e-w/gemma-3-12b-it-heretic | 3/100 | 0.16 |
Over 1,000 model variants have been created and published by the community using Heretic.
- Python >= 3.10
- PyTorch >= 2.2
- GPU recommended (CPU works but less efficient)
# Basic installation
pip install -U heretic-llm
# Include research features (visualization, etc.)
pip install -U heretic-llm[research]
# Basic usage - fully automatic, no configuration needed
heretic Qwen/Qwen3-4B-Instruct-2507
# View help
heretic --help
# Evaluate existing model
heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic
| Feature | Command | Output |
|---|
| Residual Vector Visualization | --plot-residuals | PaCMAP projection plots and animated GIF |
| Residual Geometry Analysis | --print-residual-geometry | Detailed metrics table |
- Process each supported transformer component (attention out-projection and MLP down-projection)
- Identify associated weight matrices in each transformer layer
- Compute refusal direction as mean difference of first-token residuals from example prompts
- Orthogonalize refusal direction to suppress its expression in matrix multiplication results
| Parameter | Default | Description |
|---|
n_trials | 200 | Number of ablation trials during optimization |
n_startup_trials | 60 | Number of random sampling exploration trials |
kl_divergence_scale | 1.0 | Typical KL divergence value for balancing joint optimization |
kl_divergence_target | 0.01 | Target KL divergence threshold |
quantization | "none" | Quantization method, optional "bnb_4bit" |
batch_size | 0 (auto) | Number of input sequences for parallel processing |
- Save processed model locally
- Upload model to Hugging Face Hub
- Interactive chat testing
| Hardware | Model | Processing Time |
|---|
| RTX 3090 + default config | Llama-3.1-8B-Instruct | ~45 minutes |
- PyPI Package:
heretic-llm
- Current Version: 1.2.0
- License: AGPL-3.0-or-later
- Author: Philipp Emanuel Weidmann