Dictation
Hold a key anywhere on your machine, speak, release — the transcript lands in whatever text field you had focused.
Overview
Dictation lets you turn speech into clean text anywhere on your computer. Hold a chord, talk, release — Sonna transcribes what you said with Whisper, optionally cleans it up with a local LLM, and pastes the result into the text field you had focused when you started.
Everything happens on your hardware. No cloud, no accounts, no audio leaving the machine.
Dictation was introduced in 0.5.0 alongside the Captures tab and the per-profile personality modes. It's the "input" half of Sonna's voice I/O loop — cloning and TTS are still the "output" half.
The flow
Hold the push-to-talk chord anywhere on your machine. A small pill fades in over your current app.
The pill shows Recording with a live waveform and an elapsed-time
counter. Speak naturally — you don't have to wait for anything.
On release, the pill flips to Transcribing, then Refining if
auto-refine is on, then disappears.
If auto-paste is enabled and Sonna has Accessibility permission, the transcript pastes into the text field you had focused when you started talking — not wherever focus drifted while you were speaking.
Either way, every capture also appears in the Captures tab with the original audio and the transcript paired together. See Captures for what you can do with them after the fact.
Push-to-talk and toggle modes
Sonna ships two chord behaviors out of the box:
| Mode | Default (macOS) | Default (Windows) | Behavior |
|---|---|---|---|
| Push-to-talk | Right ⌘ + Right ⌥ | Right Ctrl + Right Shift | Recording stops when you release the chord. |
| Toggle-to-talk | Push-to-talk + Space | Push-to-talk + Space | Recording keeps going until you tap the chord again. |
Holding PTT and tapping Space mid-hold upgrades a hold into a toggled
session without a gap in the audio. This is the single most useful detail of
the chord system — short bursts feel fast, long-form narration feels
hands-free, and there's no decision up front about which mode you wanted.
The on-screen pill
While you're dictating, a floating pill appears over the current app. It walks through the states of the capture cycle and shows live signals for each:
| State | What it shows |
|---|---|
Recording | Live waveform + elapsed time. |
Transcribing | Thinking waveform while Whisper runs. |
Refining | Same thinking waveform while the LLM cleans up the transcript (only if auto-refine is on). |
| Error | Red tint. Click the pill to copy the error to your clipboard. Auto-dismisses. |
The pill is transparent, always-on-top, and pre-created hidden at app start — so it appears instantly when you hit the chord, with no window flash.
Customizing the chord
Open Settings → Captures → Dictation to change either chord.
- Left vs right modifier badges. When you hold keys into the chord
picker, Sonna records whether each modifier is the left or right variant.
That means you can bind to just the right
⌥while leaving the left⌥alone — useful if you want dictation on one hand and keep your other-hand shortcuts intact. - Chord defaults are picked to stay out of your way. On macOS, the
defaults deliberately avoid left-hand
Cmd+Optionchords soCmd+Option+I(devtools),Cmd+Option+Esc(force quit), andCmd+Option+Space(Spotlight) all remain yours. On Windows, the defaults route around AltGr collisions on German / French / Spanish layouts whereCtrl+Altsynthesizes AltGr. - Live reload. Changing a chord in Settings takes effect immediately — no restart, no tab reload.
Auto-paste into the focused app
Once transcription finishes, Sonna can synthesize a native paste into whatever text field had focus when you started the chord. Your clipboard is saved before and restored after, so nothing you had copied goes missing.
| Platform | Mechanism |
|---|---|
| macOS | CGEventPost at the HID tap with a full ⌘V key sequence, preceded by reactivating the original app via NSRunningApplication. |
| Windows | SendInput with correct scan codes, plus a SetForegroundWindow + AttachThreadInput handshake to defeat foreground-lock when pasting into a window that wasn't frontmost at chord-start. |
Focus is snapshotted at chord-start. The paste targets the original field even if focus drifts during transcribe / refine — that's the "pastes where you were talking from, not where you're looking now" behavior.
Auto-paste is optional. If Accessibility permission isn't granted (macOS), or you prefer to keep synthetic input off, dictation still runs — transcripts land in the Captures tab and you can copy them manually. The setting lives inline next to the Accessibility prompt in Settings → Captures → Dictation, not as a global banner.
Refinement
If auto-refine is on, a local LLM cleans up the raw Whisper transcript before it's pasted. The goal is to remove verbal clutter without rewriting what you actually said.
What refinement typically fixes:
- Filler words (
um,uh,likeused as pauses,you know) - Self-corrections — the LLM keeps the final version and drops earlier
attempts (
could you uh run the migration real quick, and then, yeah, check the logs→Could you run the migration, then check the logs?) - Basic punctuation and capitalization
- Whisper loop hallucinations — Sonna strips repeated tokens (six or more identical tokens in a row, case-insensitive) before the LLM sees the transcript, so a small refinement model can't echo them back
What refinement deliberately preserves:
- Technical terms and code identifiers (
npm install,handleSubmit) - Legitimate repetition (
no, no, no, no, nohas fewer than six identical tokens, so it survives) - Your intent — refinement is cleanup, not rewriting
Flags are snapshotted per capture, so you can re-refine the same raw transcript later with different flags without losing the original. The refinement model picker (Settings → Captures → Refinement) offers three bundled Qwen3 sizes:
| Model | Size | Best for |
|---|---|---|
| Qwen3 0.6B | ~400 MB | Default. Very fast, good for casual dictation. |
| Qwen3 1.7B | ~1.1 GB | Sweet spot when transcripts contain code identifiers. |
| Qwen3 4B | ~2.5 GB | Full quality, slowest. |
This is the same local LLM used by the per-profile personality modes — one LLM in the app, not two. See Voice Personalities.
Platform notes
macOS
- Accessibility permission is required for auto-paste. The prompt lives inline next to the toggle in Settings → Captures → Dictation, with a deep link to System Settings → Privacy & Security → Accessibility.
- TSM crash mitigation. The global hotkey listener runs on a background
thread with
set_is_main_thread(false)to sidestep a known macOS 14+ crash in therdevlibrary. If you hit an unexpected dictation failure on macOS, check the logs for TSM-related messages.
Windows
- UAC / UIPI caveat. Synthetic paste into an elevated window from a non-elevated Sonna is blocked by Windows itself. Run Sonna elevated if you regularly dictate into elevated apps (e.g. an elevated terminal or Task Manager).
- Right-hand default chord (
Ctrl+Shift) avoids AltGr collisions on keyboard layouts whereCtrl+Altis the compose key (German, French, Spanish, some others).
Linux
- Not yet in this release. The Rust shim ships the macOS and Windows
paths in 0.5.0. Linux
uinput/ AT-SPI support and the Wayland paste story are tracked indocs/plans/VOICE_IO.md.
When auto-paste skips itself
A few cases where Sonna deliberately does not synthesize a paste:
- Focus was inside Sonna when the chord started. The transcript goes to the Captures tab so a dictation-into-Sonna round-trip doesn't accidentally paste into the generate box.
- No text focus detected. The transcript still lands in the Captures tab; copy it from there with one click.
- Accessibility permission not granted on macOS. Same — Captures tab only.