A persistent memory system for LLMs inspired by Complementary Learning Systems theory, using MEMIT weight editing for wake-phase instant memory and LoRA training during sleep cycles for knowledge consolidation.
Sleeping LLM implements a "wake-sleep" dual-phase memory architecture. During the wake phase, MEMIT injects conversational facts directly into model MLP layer weights without retrieval or external databases, serving as short-term memory. During the sleep phase, an 8-step maintenance and consolidation pipeline executes—auditing degraded edits, applying null-space constraint repairs, and performing LoRA training with fusion—transferring knowledge from MEMIT to LoRA long-term memory. Each fact is independently tracked via per-fact gating across consolidation stages (0–3), with MEMIT scaling decreasing as [1.0, 0.5, 0.1, 0.0] until fully dissolved to release capacity.
The system supports dual backends: Apple Silicon (MLX) and NVIDIA GPU (PyTorch+PEFT), validated on Llama-3.2-3B/3.1-8B/3.1-70B, capable of running a 3B model with 15 facts on MacBook Air M3 8GB with sleep cycles taking ~5 minutes.
Experimental Results
- 100% LoRA consolidation progression for 5–20 facts; Chat recall reaches 1.00 within 2–3 sleep cycles
- Recovery from 40% to 100% recall for 30 facts within 4 sleep cycles
- 8B model exhibits a phase-transition threshold at ~13 wake edits (13 edits→0.92 recall, 14 edits→0.57 recall)
- RLHF alignment suppresses LoRA knowledge injection (3B: 47%, 8B: 37%, 70B: 0%)
Interactive Commands: /sleep triggers full sleep cycle, /nap quick-audits recent facts, /status displays state, /compact compresses context window.
Known Limitations: Validation limited to synthetic person-city triples without real dialogue scenarios; no RAG comparison; 70B model may OOM at ~30 facts/session on 2×H100; all 5 papers are Zenodo preprints with no formal conference/journal acceptance.