Sonna
Overview

Introduction

Sonna is the open-source, local-first AI voice studio — a free alternative to ElevenLabs and WisprFlow, running entirely on your machine.

What is Sonna?

Sonna is the open-source, local-first AI voice studio. It closes the voice I/O loop in both directions on one machine, with no cloud and no accounts:

  • Humans talk — hold a chord anywhere on your machine and your dictation lands as clean text in whatever text field you had focused
  • Agents talk back — any MCP-aware agent can call Sonna to speak in one of your cloned voices
  • Voices speak for themselves — voice profiles can carry a personality that composes fresh lines or rewrites text before it's spoken

It's the free, local alternative to both ElevenLabs (voice cloning and TTS) and WisprFlow (voice dictation for agents and power users) — covering both sides of the same loop in one app, with a single model directory and LLM shared between input and output.

What's in the app

  • Dictation — global hotkey, push-to-talk and toggle modes, auto-paste into the focused field on macOS and Windows (see Dictation)
  • Captures tab — paired audio + transcript archive, retranscribe, refine, play-as-voice, promote-to-sample (see Captures)
  • Voice cloning — 5 cloning engines covering 23 languages. Zero-shot cloning from a reference sample (see Voice Cloning)
  • Preset voices — 50+ curated voices via Kokoro and Qwen CustomVoice for when you don't want to clone (see Preset Voices)
  • Voice personalities — optional free-form personality on any profile plus a compose button and persona-rewrite toggle powered by a local LLM (see Voice Personalities)
  • Post-processing effects — pitch shift, reverb, delay, chorus, compression, filters (Spotify's Pedalboard)
  • Expressive speech — paralinguistic tags like [laugh] and [sigh] via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice
  • Unlimited length — auto-chunking with crossfade for long scripts
  • Stories editor — multi-track timeline for conversations and podcasts
  • API-first — REST + WebSocket API; MCP server for agent integrations
  • Runs everywhere — macOS (MLX/Metal), Windows (CUDA / DirectML), Linux (ROCm / CPU), Intel Arc, Docker

TTS Engines

Seven engines with different strengths, switchable per-generation:

EngineProfile TypeLanguagesStrengths
Qwen3-TTS (0.6B / 1.7B)Cloned10High-quality multilingual cloning
Qwen CustomVoice (0.6B / 1.7B)Preset (9 voices)10Natural-language delivery control (tone, emotion, pace)
LuxTTSClonedEnglishLightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU
Chatterbox MultilingualCloned23Broadest language coverage
Chatterbox TurboClonedEnglishFast 350M model with paralinguistic emotion/sound tags
TADA (1B / 3B)Cloned10HumeAI speech-language model — 700s+ coherent audio
KokoroPreset (50 voices)982M parameters, CPU realtime, lowest VRAM of any engine

STT and local LLM

Sonna also runs a full speech recognition and local LLM stack, shared between dictation, the Captures tab, and per-profile personality modes:

LayerModels
STTWhisper Base / Small / Medium / Large / Turbo (PyTorch or MLX)
LLMQwen3 0.6B / 1.7B / 4B (refinement + per-profile compose / persona-rewrite)

No cloud fallback, no bring-your-own-API-key. Local is the product.

GPU Support

PlatformBackendNotes
macOS (Apple Silicon)MLX (Metal)4-5x faster via Neural Engine
Windows / Linux (NVIDIA)PyTorch (CUDA)Auto-downloads CUDA binary from within the app
Linux (AMD)PyTorch (ROCm)Auto-configures HSA_OVERRIDE_GFX_VERSION
Windows (any GPU)DirectMLUniversal Windows GPU support
Intel ArcIPEX/XPUIntel discrete GPU acceleration
AnyCPUWorks everywhere, just slower

Use Cases

  • Dictation for humans and agents — speak instead of type, in any app
  • Agent voice output — any MCP-aware agent can speak in a cloned voice
  • Game development — generate dynamic dialogue for characters
  • Content creation — podcasts, video voiceovers, audiobooks
  • Accessibility — speech-to-text for any field, TTS with a voice you own
  • Voice assistants — custom voice interfaces without a cloud bill
  • Production pipelines — automate voice workflows via the REST API

Tech Stack

LayerTechnology
Desktop AppTauri (Rust)
FrontendReact, TypeScript, Tailwind CSS
StateZustand, React Query
BackendFastAPI (Python)
TTS EnginesQwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro
STTWhisper / Whisper Turbo (PyTorch or MLX)
Local LLMQwen3 0.6B / 1.7B / 4B (MLX or PyTorch)
EffectsPedalboard (Spotify)
InferenceMLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU)
DatabaseSQLite
AudioWaveSurfer.js, librosa

On this page