Sonna
Overview

Recording & Transcription

A map of the three places you can record and transcribe audio in Sonna — dictation, captures, and voice-profile samples.

Overview

Sonna records and transcribes audio in three different contexts, each feeding a different surface in the app. This page is a map; follow the links for the detail.

GoalWhereDocs
Speak and have your words land in another appGlobal hotkey → Captures tab + auto-pasteDictation
Record a thought, a meeting, or a voice memo inside SonnaCaptures tabCaptures
Record a clip to clone a voice fromVoices tab → profile samplesCreating Voice Profiles

All three paths share the same STT backend — it's the surrounding workflow that differs.

Dictation

The 0.5.0 headline feature. Hold a chord anywhere on your machine, speak, release. The transcript lands in whatever text field you had focused, cleaned up by a local LLM if auto-refine is on. Captures accumulate in the Captures tab for later replay or re-transcription.

Covered end-to-end in Dictation.

Captures tab

When you don't need to paste into another app — you just want a clean transcript of some audio — the Captures tab is the home. Record in-app, drop in a file (.wav, .mp3, .m4a, .webm, .opus, .flac), or dig through dictations that already landed there. Every capture keeps its original audio, can be retranscribed with a different model, and can be played back through any voice profile you have.

Covered in Captures.

Voice profile samples

A separate flow, in the Voices tab. When you're creating a profile from an audio clip, the sample is what the cloning engine actually learns from — the reference_text on a sample must match the audio verbatim, which is why samples are a different data model from captures.

You can promote a capture to a sample from the Captures tab's Send-to menu ("Use as voice sample…"), which opens a reference-text confirm dialog so you can correct the last ~10% of transcript accuracy before saving.

Covered in Creating Voice Profiles.

Transcription models

All three paths share the same Whisper models. Pick a default in Settings → Captures → Transcription; override per capture if you need to.

ModelSizeWhen to pick it
Whisper Base~300 MBFast. Default. Good for clean speech.
Whisper Small~500 MBBetter quality, still fast.
Whisper Medium~1.5 GBHigh quality.
Whisper Large~3 GBBest quality, slow on CPU.
Whisper Turbo~1.5 GBLarge-tier quality, ~5× faster than Large.

On Apple Silicon the model runs through MLX-Whisper (~8× faster than PyTorch). Everywhere else it runs through PyTorch transformers. The backend picks the right one — you don't configure it.

For noisy clips, prefer Turbo or Large. Base can hallucinate on hard inputs — most famously the "thanks for watching" loop. Sonna strips those loops deterministically before LLM refinement runs, so a capture can be cleanly re-refined even if the raw transcript has them.

Language

You can pass a language hint for short clips (under ~5 seconds) where Whisper's auto-detect is unreliable. Set a default language lock in Settings → Captures → Transcription → Language, or override per capture.

Transcription API

Developer-level detail on the STT backend, model loading, preprocessing, and the /transcribe endpoint lives in the Transcription developer guide. The Captures pipeline also exposes /captures as a higher-level endpoint that wraps STT + archival + optional refinement in one call — see Captures.

Next steps

On this page