DISCOVER THE FUTURE OF AI AGENTSarrow_forward

Heretic

calendar_todayAdded Feb 23, 2026
categoryModel & Inference Framework
codeOpen Source
PythonPyTorch大语言模型MultimodalTransformersCLIModel & Inference FrameworkModel Training & InferenceSecurity & Privacy

Fully automatic censorship removal tool for language models using directional ablation with TPE parameter optimization to remove safety alignment while minimizing refusal behaviors and preserving original model capabilities. Supports dense, multimodal, and MoE architectures.

Heretic - Fully Automatic Censorship Removal for Language Models#

Overview#

Heretic is an automated censorship removal tool for Transformer-based Large Language Models (LLMs). Its core function is to automatically identify and ablate the "refusal direction" in models, thereby removing safety alignment mechanisms and enabling responses to prompts that would otherwise be refused, without expensive post-training.

Core Features#

Automatic Censorship Removal#

  • Fully automatic process, no manual configuration or expensive post-training required
  • Based on directional ablation (abliteration) techniques, referencing Arditi et al. 2024, Lai 2025 research

TPE Parameter Optimization#

  • Automatic search for optimal ablation parameters via Optuna's Tree-structured Parzen Estimator
  • Default 200 optimization trials, with first 60 as random sampling exploration phase

Joint Optimization Objective#

  • Simultaneously minimize refusal count and KL divergence from original model
  • Preserve original model intelligence while removing censorship

Technical Innovations (vs. existing abliteration systems)#

  1. Flexible Ablation Weight Kernel: Highly configurable kernel shape with automatic parameter optimization for better compliance/quality tradeoff
  2. Float Refusal Direction Index: Supports float indices for linear interpolation between two nearest refusal direction vectors
  3. Component-wise Ablation Parameters: Different ablation weights for attention and MLP components (MLP intervention typically more damaging to model)

Supported Model Architectures#

TypeSupport Status
Dense Models✅ Most mainstream dense Transformer models
Multimodal Models✅ Multimodal architectures supported
MoE Architectures✅ Various Mixture of Experts architectures
SSM/Hybrid Models❌ Not supported
Non-homogeneous Layer Models❌ Not supported
Some Novel Attention Systems❌ Not supported

Performance#

Benchmark comparison using gemma-3-12b-it:

Model Variant"Harmful" Prompt Refusals/100KL Divergence ("Harmless" prompts)
google/gemma-3-12b-it (original)97/1000 (baseline)
mlabonne/gemma-3-12b-it-abliterated-v23/1001.04
huihui-ai/gemma-3-12b-it-abliterated3/1000.45
p-e-w/gemma-3-12b-it-heretic3/1000.16

Over 1,000 model variants have been created and published by the community using Heretic.

Installation & Quick Start#

Requirements#

  • Python >= 3.10
  • PyTorch >= 2.2
  • GPU recommended (CPU works but less efficient)

Installation#

# Basic installation
pip install -U heretic-llm

# Include research features (visualization, etc.)
pip install -U heretic-llm[research]

Quick Usage#

# Basic usage - fully automatic, no configuration needed
heretic Qwen/Qwen3-4B-Instruct-2507

# View help
heretic --help

# Evaluate existing model
heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic

Research Features (requires [research] extra)#

FeatureCommandOutput
Residual Vector Visualization--plot-residualsPaCMAP projection plots and animated GIF
Residual Geometry Analysis--print-residual-geometryDetailed metrics table

Ablation Mechanism#

  1. Process each supported transformer component (attention out-projection and MLP down-projection)
  2. Identify associated weight matrices in each transformer layer
  3. Compute refusal direction as mean difference of first-token residuals from example prompts
  4. Orthogonalize refusal direction to suppress its expression in matrix multiplication results

Key Configuration Parameters#

ParameterDefaultDescription
n_trials200Number of ablation trials during optimization
n_startup_trials60Number of random sampling exploration trials
kl_divergence_scale1.0Typical KL divergence value for balancing joint optimization
kl_divergence_target0.01Target KL divergence threshold
quantization"none"Quantization method, optional "bnb_4bit"
batch_size0 (auto)Number of input sequences for parallel processing

Output Options#

  • Save processed model locally
  • Upload model to Hugging Face Hub
  • Interactive chat testing

Typical Processing Time#

HardwareModelProcessing Time
RTX 3090 + default configLlama-3.1-8B-Instruct~45 minutes

Version Information#

  • PyPI Package: heretic-llm
  • Current Version: 1.2.0
  • License: AGPL-3.0-or-later
  • Author: Philipp Emanuel Weidmann

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch