Fully automatic censorship removal tool for language models using directional ablation with TPE parameter optimization to remove safety alignment while minimizing refusal behaviors and preserving original model capabilities. Supports dense, multimodal, and MoE architectures.

Heretic - Fully Automatic Censorship Removal for Language Models#

Overview#

Heretic is an automated censorship removal tool for Transformer-based Large Language Models (LLMs). Its core function is to automatically identify and ablate the "refusal direction" in models, thereby removing safety alignment mechanisms and enabling responses to prompts that would otherwise be refused, without expensive post-training.

Core Features#

Automatic Censorship Removal#

Fully automatic process, no manual configuration or expensive post-training required
Based on directional ablation (abliteration) techniques, referencing Arditi et al. 2024, Lai 2025 research

TPE Parameter Optimization#

Automatic search for optimal ablation parameters via Optuna's Tree-structured Parzen Estimator
Default 200 optimization trials, with first 60 as random sampling exploration phase

Joint Optimization Objective#

Simultaneously minimize refusal count and KL divergence from original model
Preserve original model intelligence while removing censorship

Technical Innovations (vs. existing abliteration systems)#

Flexible Ablation Weight Kernel: Highly configurable kernel shape with automatic parameter optimization for better compliance/quality tradeoff
Float Refusal Direction Index: Supports float indices for linear interpolation between two nearest refusal direction vectors
Component-wise Ablation Parameters: Different ablation weights for attention and MLP components (MLP intervention typically more damaging to model)

Supported Model Architectures#

Type	Support Status
Dense Models	✅ Most mainstream dense Transformer models
Multimodal Models	✅ Multimodal architectures supported
MoE Architectures	✅ Various Mixture of Experts architectures
SSM/Hybrid Models	❌ Not supported
Non-homogeneous Layer Models	❌ Not supported
Some Novel Attention Systems	❌ Not supported

Performance#

Benchmark comparison using gemma-3-12b-it:

Model Variant	"Harmful" Prompt Refusals/100	KL Divergence ("Harmless" prompts)
google/gemma-3-12b-it (original)	97/100	0 (baseline)
mlabonne/gemma-3-12b-it-abliterated-v2	3/100	1.04
huihui-ai/gemma-3-12b-it-abliterated	3/100	0.45
p-e-w/gemma-3-12b-it-heretic	3/100	0.16

Over 1,000 model variants have been created and published by the community using Heretic.

Installation & Quick Start#

Requirements#

Python >= 3.10
PyTorch >= 2.2
GPU recommended (CPU works but less efficient)

Installation#

# Basic installation
pip install -U heretic-llm

# Include research features (visualization, etc.)
pip install -U heretic-llm[research]

Quick Usage#

# Basic usage - fully automatic, no configuration needed
heretic Qwen/Qwen3-4B-Instruct-2507

# View help
heretic --help

# Evaluate existing model
heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic

Research Features (requires `[research]` extra)#

Feature	Command	Output
Residual Vector Visualization	`--plot-residuals`	PaCMAP projection plots and animated GIF
Residual Geometry Analysis	`--print-residual-geometry`	Detailed metrics table

Ablation Mechanism#

Process each supported transformer component (attention out-projection and MLP down-projection)
Identify associated weight matrices in each transformer layer
Compute refusal direction as mean difference of first-token residuals from example prompts
Orthogonalize refusal direction to suppress its expression in matrix multiplication results

Key Configuration Parameters#

Parameter	Default	Description
`n_trials`	200	Number of ablation trials during optimization
`n_startup_trials`	60	Number of random sampling exploration trials
`kl_divergence_scale`	1.0	Typical KL divergence value for balancing joint optimization
`kl_divergence_target`	0.01	Target KL divergence threshold
`quantization`	"none"	Quantization method, optional "bnb_4bit"
`batch_size`	0 (auto)	Number of input sequences for parallel processing

Output Options#

Save processed model locally
Upload model to Hugging Face Hub
Interactive chat testing

Typical Processing Time#

Hardware	Model	Processing Time
RTX 3090 + default config	Llama-3.1-8B-Instruct	~45 minutes

Version Information#

PyPI Package: heretic-llm
Current Version: 1.2.0
License: AGPL-3.0-or-later
Author: Philipp Emanuel Weidmann

Heretic