DISCOVER THE FUTURE OF AI AGENTSarrow_forward

Moonshine Voice

calendar_todayAdded Feb 22, 2026
categoryModel & Inference Framework
codeOpen Source
PythonPyTorchMultimodalTransformersSDKCLIModel & Inference FrameworkModel Training & InferenceProtocol, API & Integration

Fast and accurate automatic speech recognition (ASR) optimized for edge devices. Features streaming support, voice intent recognition, and speaker identification with significantly lower latency than Whisper (107ms on Mac with Medium model). Provides unified API across iOS, Android, Linux, Windows, and macOS, ideal for robotics, smart home, and IoT applications.

Moonshine Voice is an automatic speech recognition (ASR) solution developed by Useful Sensors, optimized for edge devices and real-time streaming applications, released in October 2024.

Key Advantages#

  • Flexible Input Windows: No fixed 30-second window like Whisper, no zero-padding overhead
  • Streaming Cache: Supports incremental audio input with encoder/decoder state caching
  • Ultra-low Latency: Medium Streaming model achieves 107ms on MacBook Pro, ~802ms on Raspberry Pi 5
  • Cross-platform Consistency: Unified API across iOS, Android, Linux, Windows, macOS

Performance Comparison (vs Whisper)#

ModelWERParametersMacBook Pro LatencyRaspberry Pi 5 Latency
Moonshine Medium Streaming6.65%245M107ms802ms
Whisper Large v37.44%1.5B11,286msN/A
Moonshine Small Streaming7.84%123M73ms527ms
Whisper Small8.59%244M1,940ms10,397ms

Language Support#

English, Spanish, Chinese, Japanese, Korean, Vietnamese, Ukrainian, Arabic

Model Sizes#

ArchitectureParametersEnglish WER
Tiny26M12.66%
Tiny Streaming34M12.00%
Base58M10.07%
Small Streaming123M7.84%
Medium Streaming245M6.65%

Core Capabilities#

  • Real-time Speech-to-Text (ASR)
  • Voice Intent Recognition (semantic matching for predefined commands)
  • Speaker Identification (distinguish different speakers)
  • VAD Segmentation (based on Silero VAD)

Technical Architecture#

Microphone Input → VAD (Silero) → Streaming Encoder → Decoder → Text Output
                                    ↓
                               Speaker ID
                                    ↓
                               Intent Recognition
  • Model Architecture: Encoder-Decoder Transformer
  • Positional Encoding: Rotary Position Embedding (RoPE)
  • Inference Engine: ONNX Runtime (.ort format, memory-mapped optimization)
  • Quantization: 8-bit weights + 8-bit MatMul

Quick Start#

pip install moonshine-voice
python -m moonshine_voice.download --language en
python -m moonshine_voice.mic_transcriber --language en

Python API Example#

from moonshine_voice import Transcriber, TranscriptEventListener

transcriber = Transcriber(model_path=model_path, model_arch=model_arch)

class TestListener(TranscriptEventListener):
    def on_line_completed(self, event):
        print(f"Line completed: {event.line.text}")

transcriber.add_listener(TestListener())
transcriber.start()
transcriber.add_audio(audio_chunk, sample_rate)
transcriber.stop()

Key Configuration Options#

OptionDescriptionDefault
update_intervalTranscription update interval (seconds)0.5s
max_tokens_per_secondHallucination detection threshold6.5
vad_thresholdVAD sensitivity0.5
identify_speakersSpeaker identification toggletrue

Use Cases#

  • Real-time transcription apps (subtitles, meeting notes)
  • Voice command systems (robotics, smart home, automotive)
  • Edge device voice interaction (Raspberry Pi, IoT, wearables)
  • Privacy-sensitive offline speech processing

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch