Speech-AI-Forge

A project focused on TTS generation models, providing an API server and Gradio-based WebUI with support for multiple voice synthesis, voice cloning, and audio enhancement capabilities.

One-Minute Overview#

Speech-AI-Forge is a comprehensive voice AI toolkit designed for developers and content creators. It integrates multiple advanced Text-to-Speech models including ChatTTS, CosyVoice, FishSpeech, and others, providing both intuitive Web interface and API services. Whether you need to quickly generate voice content, create multi-character audio, or perform voice cloning, this project offers all the necessary tools.

Core Value: A one-stop voice AI solution providing complete functionality from basic TTS to advanced voice cloning capabilities

Quick Start#

Installation Difficulty: Medium - Requires manual model downloads and environment setup

# First, download required models
python -m scripts.download_models --source modelscope

# Start the WebUI
# Start the API service
python launch.py

Is this suitable for my needs?

✅ Content Creators: Need to convert text to high-quality audio with multiple voices and styles

✅ Developers: Need to integrate voice capabilities into applications

✅ Voice Cloning Enthusiasts: Want to replicate specific voices for synthesis

❌ Beginners: Project requires technical background, especially for model download and configuration

Core Capabilities#

1. Multi-Model TTS Support - Diverse Voice Generation Options#

Supports multiple TTS models including ChatTTS, CosyVoice, FishSpeech, FireRedTTS, GPT-SoVITS
Select the most suitable model based on your use case Actual Value: Provides diverse voice generation options, allowing users to choose the best model based on quality, style, or specific requirements

2. SSML Advanced Control - Precise Voice Output Control#

XML-based syntax for speech synthesis control
Supports multi-character, multi-emotion long text generation Actual Value: Creates expressive conversational content like audiobooks, podcasts with multiple characters

3. Voice Management System - Personalized Voice Customization#

Multiple built-in voices (27 ChatTTS, 7 CosyVoice)
Supports uploading custom voice files
Create voices from reference audio Actual Value: Enables users to create unique and consistent voices, enhancing brand recognition or character personality

4. Audio Enhancement - Improved Output Quality#

Integrated ResembleEnhance model
Supports voice enhancement and post-processing Actual Value: Significantly improves naturalness and clarity of synthesized speech, approaching real human voice quality

5. API Service Integration - Seamless System Integration#

Provides RESTful API interface
Supports integration with platforms like SillyTavern Actual Value: Allows developers to easily integrate voice capabilities into existing applications and platforms

Technology Stack & Integration#

Development Language: Python Main Dependencies: Gradio (WebUI), various TTS and ASR models Integration Method: API Server / Web Interface / Docker Container

Ecosystem & Extensions#

Model Support: Plans to support more TTS, ASR, and voice cloning models
Plugin System: Can integrate with platforms like SillyTavern via API
Container Deployment: Provides Docker Compose configuration for simplified deployment

Maintenance Status#

Development Activity: Active development with multiple commits per week
Recent Updates: Continuous addition of new model features and optimizations
Community Response: Active handling of user issues and suggestions

Documentation & Learning Resources#

Documentation Quality: Comprehensive, including detailed installation guides, feature explanations, and FAQ
Official Documentation: Complete documentation available in the project README
Example Code: Provides examples for style control and long text generation