Moonshine Voice

Fast and accurate automatic speech recognition (ASR) optimized for edge devices. Features streaming support, voice intent recognition, and speaker identification with significantly lower latency than Whisper (107ms on Mac with Medium model). Provides unified API across iOS, Android, Linux, Windows, and macOS, ideal for robotics, smart home, and IoT applications.

Moonshine Voice is an automatic speech recognition (ASR) solution developed by Useful Sensors, optimized for edge devices and real-time streaming applications, released in October 2024.

Key Advantages#

Flexible Input Windows: No fixed 30-second window like Whisper, no zero-padding overhead
Streaming Cache: Supports incremental audio input with encoder/decoder state caching
Ultra-low Latency: Medium Streaming model achieves 107ms on MacBook Pro, ~802ms on Raspberry Pi 5
Cross-platform Consistency: Unified API across iOS, Android, Linux, Windows, macOS

Performance Comparison (vs Whisper)#

Model	WER	Parameters	MacBook Pro Latency	Raspberry Pi 5 Latency
Moonshine Medium Streaming	6.65%	245M	107ms	802ms
Whisper Large v3	7.44%	1.5B	11,286ms	N/A
Moonshine Small Streaming	7.84%	123M	73ms	527ms
Whisper Small	8.59%	244M	1,940ms	10,397ms

Language Support#

English, Spanish, Chinese, Japanese, Korean, Vietnamese, Ukrainian, Arabic

Model Sizes#

Architecture	Parameters	English WER
Tiny	26M	12.66%
Tiny Streaming	34M	12.00%
Base	58M	10.07%
Small Streaming	123M	7.84%
Medium Streaming	245M	6.65%

Core Capabilities#

Real-time Speech-to-Text (ASR)
Voice Intent Recognition (semantic matching for predefined commands)
Speaker Identification (distinguish different speakers)
VAD Segmentation (based on Silero VAD)

Technical Architecture#

Microphone Input → VAD (Silero) → Streaming Encoder → Decoder → Text Output
                                    ↓
                               Speaker ID
                                    ↓
                               Intent Recognition

Model Architecture: Encoder-Decoder Transformer
Positional Encoding: Rotary Position Embedding (RoPE)
Inference Engine: ONNX Runtime (.ort format, memory-mapped optimization)
Quantization: 8-bit weights + 8-bit MatMul

Quick Start#

pip install moonshine-voice
python -m moonshine_voice.download --language en
python -m moonshine_voice.mic_transcriber --language en

Python API Example#

from moonshine_voice import Transcriber, TranscriptEventListener

transcriber = Transcriber(model_path=model_path, model_arch=model_arch)

class TestListener(TranscriptEventListener):
    def on_line_completed(self, event):
        print(f"Line completed: {event.line.text}")

transcriber.add_listener(TestListener())
transcriber.start()
transcriber.add_audio(audio_chunk, sample_rate)
transcriber.stop()

Key Configuration Options#

Option	Description	Default
update_interval	Transcription update interval (seconds)	0.5s
max_tokens_per_second	Hallucination detection threshold	6.5
vad_threshold	VAD sensitivity	0.5
identify_speakers	Speaker identification toggle	true

Use Cases#

Real-time transcription apps (subtitles, meeting notes)
Voice command systems (robotics, smart home, automotive)
Edge device voice interaction (Raspberry Pi, IoT, wearables)
Privacy-sensitive offline speech processing

Key Advantages#

Performance Comparison (vs Whisper)#

Language Support#

Model Sizes#

Core Capabilities#

Technical Architecture#

Quick Start#

Python API Example#

Key Configuration Options#

Use Cases#

Related Projects

oh-my-codex

Ironcurtain

vibe-remote

STAY UPDATED