Fast and accurate automatic speech recognition (ASR) optimized for edge devices. Features streaming support, voice intent recognition, and speaker identification with significantly lower latency than Whisper (107ms on Mac with Medium model). Provides unified API across iOS, Android, Linux, Windows, and macOS, ideal for robotics, smart home, and IoT applications.
Moonshine Voice is an automatic speech recognition (ASR) solution developed by Useful Sensors, optimized for edge devices and real-time streaming applications, released in October 2024.
Key Advantages#
- Flexible Input Windows: No fixed 30-second window like Whisper, no zero-padding overhead
- Streaming Cache: Supports incremental audio input with encoder/decoder state caching
- Ultra-low Latency: Medium Streaming model achieves 107ms on MacBook Pro, ~802ms on Raspberry Pi 5
- Cross-platform Consistency: Unified API across iOS, Android, Linux, Windows, macOS
Performance Comparison (vs Whisper)#
| Model | WER | Parameters | MacBook Pro Latency | Raspberry Pi 5 Latency |
|---|---|---|---|---|
| Moonshine Medium Streaming | 6.65% | 245M | 107ms | 802ms |
| Whisper Large v3 | 7.44% | 1.5B | 11,286ms | N/A |
| Moonshine Small Streaming | 7.84% | 123M | 73ms | 527ms |
| Whisper Small | 8.59% | 244M | 1,940ms | 10,397ms |
Language Support#
English, Spanish, Chinese, Japanese, Korean, Vietnamese, Ukrainian, Arabic
Model Sizes#
| Architecture | Parameters | English WER |
|---|---|---|
| Tiny | 26M | 12.66% |
| Tiny Streaming | 34M | 12.00% |
| Base | 58M | 10.07% |
| Small Streaming | 123M | 7.84% |
| Medium Streaming | 245M | 6.65% |
Core Capabilities#
- Real-time Speech-to-Text (ASR)
- Voice Intent Recognition (semantic matching for predefined commands)
- Speaker Identification (distinguish different speakers)
- VAD Segmentation (based on Silero VAD)
Technical Architecture#
Microphone Input → VAD (Silero) → Streaming Encoder → Decoder → Text Output
↓
Speaker ID
↓
Intent Recognition
- Model Architecture: Encoder-Decoder Transformer
- Positional Encoding: Rotary Position Embedding (RoPE)
- Inference Engine: ONNX Runtime (.ort format, memory-mapped optimization)
- Quantization: 8-bit weights + 8-bit MatMul
Quick Start#
pip install moonshine-voice
python -m moonshine_voice.download --language en
python -m moonshine_voice.mic_transcriber --language en
Python API Example#
from moonshine_voice import Transcriber, TranscriptEventListener
transcriber = Transcriber(model_path=model_path, model_arch=model_arch)
class TestListener(TranscriptEventListener):
def on_line_completed(self, event):
print(f"Line completed: {event.line.text}")
transcriber.add_listener(TestListener())
transcriber.start()
transcriber.add_audio(audio_chunk, sample_rate)
transcriber.stop()
Key Configuration Options#
| Option | Description | Default |
|---|---|---|
| update_interval | Transcription update interval (seconds) | 0.5s |
| max_tokens_per_second | Hallucination detection threshold | 6.5 |
| vad_threshold | VAD sensitivity | 0.5 |
| identify_speakers | Speaker identification toggle | true |
Use Cases#
- Real-time transcription apps (subtitles, meeting notes)
- Voice command systems (robotics, smart home, automotive)
- Edge device voice interaction (Raspberry Pi, IoT, wearables)
- Privacy-sensitive offline speech processing