Creating Voice Profiles
How to create voice profiles, both cloning-based and preset-based
Overview
A voice profile is a saved voice you can reuse across generations, stories, and the API. As of 0.4, Sonna profiles come in two flavors that map to two different ways of getting a voice:
| Profile type | What it stores | Use when… |
|---|---|---|
| Cloned | One or more reference audio samples + a voice embedding | You want to replicate a specific person's voice |
| Preset | A reference to a pre-built voice in a specific engine | You want a curated, production-ready voice with no audio prep |
Both types live in the same Profiles tab and behave the same way at generation time — pick the type that matches your goal and follow the workflow below.
Not sure which to use? Cloning gives you a specific voice but needs clean audio. Preset gives you good voices instantly but you don't get to choose who they sound like.
Workflow A — Cloned Profiles
Use this when you want to replicate a specific person's voice from a recording.
10-30 seconds of clear speech, minimal background noise. See Voice Cloning for the engine catalog.
Profiles → + New Profile → choose a cloning engine (Qwen3-TTS, Chatterbox Multilingual, Chatterbox Turbo, LuxTTS, or TADA)
Drag in an audio file, or record directly with the in-app recorder
Use the profile to generate a test phrase. If quality is poor, add more samples
Audio Requirements (Cloning Only)
Duration
10-30 seconds
Too short: Poor quality Too long: Unnecessary
Clarity
Clear speech
No background noise No music or overlapping voices
Quality
High fidelity
44.1 kHz or 48 kHz sample rate Minimal compression
Content
Natural speech
Conversational tone Complete sentences
File Formats
Supported formats:
- WAV (recommended) — Lossless quality
- MP3 — Acceptable, minimal compression
- M4A — Acceptable
- FLAC — Lossless alternative
Use WAV for best results. Avoid heavily compressed formats.
Recording Tips
▶Quiet Space
- Record in a quiet room
- Turn off fans, AC, appliances
- Close windows to reduce outside noise
- Use soft furnishings to reduce echo
▶Microphone Placement
- 6-12 inches from mouth
- Slight angle to reduce plosives (p, b, t)
- Use a pop filter if available
- Maintain consistent distance
▶Recording Settings
- 44.1 kHz or 48 kHz sample rate
- 16-bit or 24-bit depth
- Mono is fine (stereo will be converted)
- Avoid automatic gain control
Speaking Style
- Natural pace — Don't rush or speak too slowly
- Clear articulation — Pronounce words clearly
- Consistent volume — Maintain steady loudness
- Normal tone — Speak as you normally would
- Complete sentences — Avoid fragments or "ums"
Multiple Samples
Adding multiple samples can significantly improve quality:
Robustness
Model learns a more complete representation
Versatility
Handles different speaking styles better
Quality
Reduces artifacts and improves naturalness
Consistency
More reliable across different texts
Consider adding samples with:
- Different tones — casual, formal, excited, calm
- Different content — narratives, questions, statements
- Different recording conditions — studio quality, room acoustics
All samples should be from the same speaker. Mixing voices will produce poor results.
Processing Existing Audio
If you have existing audio (podcasts, videos, etc.):
Look for segments with just the target speaker, no background music, minimal noise
Tools like Audacity or Adobe Audition: cut clean 10-30s segments, remove silence at start/end, normalize volume
Save as high-quality WAV file
For light background noise, use Audacity's noise reduction (gentle settings — over-processing introduces artifacts).
Testing & Iteration
After creating a cloned profile:
Try a simple phrase: "Hello, this is a test of my voice profile."
Listen for natural tone, clear pronunciation, proper prosody, lack of artifacts
If quality is poor: add more samples, try different source audio, check sample quality
Common Issues
▶Robotic Voice
Cause: Poor quality samples or too short
Fix: Use longer, higher-quality samples
▶Wrong Tone
Cause: Sample tone doesn't match desired output
Fix: Record samples in the style you want to generate
▶Artifacts/Glitches
Cause: Background noise or audio issues in samples
Fix: Clean up samples or re-record in quieter environment
Workflow B — Preset Profiles
Use this when you want a ready-made voice without recording anything. Available engines: Kokoro 82M (50 voices) and Qwen CustomVoice (9 voices). See Preset Voices for the full catalog.
Profiles → + New Profile → choose Kokoro or Qwen CustomVoice as the engine
The engine's voice catalog appears. Click any voice to preview it
Give the profile a name. No audio sample required
The profile is ready immediately — use it in the floating generate box or Generate page
Preset profiles are locked to their source engine. Switching to a different engine in the floating generate box greys out the profile, since the voice only exists in that engine. Clicking a greyed profile auto-switches the engine back.
Qwen CustomVoice + Instruct
Preset voices in Qwen CustomVoice support delivery instructions — natural-language style control over tone, pace, and emotion. The floating generate box shows a slider icon next to the generate button when a Qwen CustomVoice profile is selected; click it to reveal the instruct textarea.
See Preset Voices → Using Instruct Mode for examples.
Advanced Tips
Celebrity / Character Voices (Cloning)
For cloning public figures or characters:
- Legal considerations — Ensure you have rights or it's clearly fair use
- Source quality — Find high-quality interview audio or clean clips
- Consistency — Use clips where they speak similarly
- Multiple samples — Very important for recognizable voices
Accent & Dialect (Cloning)
Cloning models preserve accent and dialect:
- British English samples generate British English output
- Southern accent samples produce Southern accent output
- Regional pronunciations are maintained
Emotion Transfer (Cloning)
The emotional tone of samples affects generation:
- Energetic samples → energetic output
- Calm samples → calm output
- Mix samples for a more versatile profile
For Qwen CustomVoice presets, use the instruct field instead of relying on sample emotion — that's exactly what it controls.
Managing Profiles
Organization
- Descriptive names — "John Smith - Professional Narrator"
- Add descriptions — Note recording conditions, use cases, or which preset voice
- Language tags — Mark the primary language
- Archive unused — Keep profile list manageable
Export / Import
- Export profiles to share or backup
- Import from colleagues or teammates
- Cloned profiles export with their voice embeddings (not the original audio)
- Preset profiles export as engine + voice ID metadata only — the importer must have that engine's model installed