Recording & Transcription

A map of the three places you can record and transcribe audio in Sonna — dictation, captures, and voice-profile samples.

Overview

Sonna records and transcribes audio in three different contexts, each feeding a different surface in the app. This page is a map; follow the links for the detail.

Goal	Where	Docs
Speak and have your words land in another app	Global hotkey → Captures tab + auto-paste	Dictation
Record a thought, a meeting, or a voice memo inside Sonna	Captures tab	Captures
Record a clip to clone a voice from	Voices tab → profile samples	Creating Voice Profiles

All three paths share the same STT backend — it's the surrounding workflow that differs.

The 0.5.0 headline feature. Hold a chord anywhere on your machine, speak, release. The transcript lands in whatever text field you had focused, cleaned up by a local LLM if auto-refine is on. Captures accumulate in the Captures tab for later replay or re-transcription.

Covered end-to-end in Dictation.

Captures tab

When you don't need to paste into another app — you just want a clean transcript of some audio — the Captures tab is the home. Record in-app, drop in a file (.wav, .mp3, .m4a, .webm, .opus, .flac), or dig through dictations that already landed there. Every capture keeps its original audio, can be retranscribed with a different model, and can be played back through any voice profile you have.

Covered in Captures.

Voice profile samples

A separate flow, in the Voices tab. When you're creating a profile from an audio clip, the sample is what the cloning engine actually learns from — the reference_text on a sample must match the audio verbatim, which is why samples are a different data model from captures.

You can promote a capture to a sample from the Captures tab's Send-to menu ("Use as voice sample…"), which opens a reference-text confirm dialog so you can correct the last ~10% of transcript accuracy before saving.

Covered in Creating Voice Profiles.

Transcription models

All three paths share the same Whisper models. Pick a default in Settings → Captures → Transcription; override per capture if you need to.

Model	Size	When to pick it
Whisper Base	~300 MB	Fast. Default. Good for clean speech.
Whisper Small	~500 MB	Better quality, still fast.
Whisper Medium	~1.5 GB	High quality.
Whisper Large	~3 GB	Best quality, slow on CPU.
Whisper Turbo	~1.5 GB	Large-tier quality, ~5× faster than Large.

On Apple Silicon the model runs through MLX-Whisper (~8× faster than PyTorch). Everywhere else it runs through PyTorch transformers. The backend picks the right one — you don't configure it.

For noisy clips, prefer Turbo or Large. Base can hallucinate on hard inputs — most famously the "thanks for watching" loop. Sonna strips those loops deterministically before LLM refinement runs, so a capture can be cleanly re-refined even if the raw transcript has them.

Language

You can pass a language hint for short clips (under ~5 seconds) where Whisper's auto-detect is unreliable. Set a default language lock in Settings → Captures → Transcription → Language, or override per capture.

Transcription API

Developer-level detail on the STT backend, model loading, preprocessing, and the /transcribe endpoint lives in the Transcription developer guide. The Captures pipeline also exposes /captures as a higher-level endpoint that wraps STT + archival + optional refinement in one call — see Captures.

Recording & Transcription

Overview

Dictation

Captures tab

Voice profile samples

Transcription models

Language

Transcription API

Next steps

Dictation

Captures

Creating Voice Profiles

On this page