Voice over your video
Add narration with ElevenLabs, OpenAI TTS, or Deepgram.
Voice over your video
Three providers, one workflow. Pick the right one:
| Provider | Strength | Pricing tier |
| ElevenLabs | Highest quality + voice cloning | $$ |
| OpenAI TTS-1 | Fast, reliable, six built-in voices | $ |
| Deepgram Aura 2 | Lowest latency, cheapest | $ |
| Cartesia Sonic 2 | Ultra-low latency, large voice library | $ |
Agents: discover valid voice ids first
Call list_voices before generate_voice. It returns every built-in
voice with its voiceModel + exact voiceId (e.g. openai-tts-1 → nova,
deepgram-aura-2 → aura-2-thalia-en). Voice ids are provider-specific —
passing the wrong scheme (e.g. aura to Deepgram) safely falls back to that
provider's default voice, but using the exact id gives you the voice you want.
Pull your voice library
1. Add the provider's key in Settings → Keys.
2. Visit /voices. Hit Sync on the provider's row. Your library
lands in the grid.
3. Click ▶ on any voice to preview.
Generate audio for a shot
In the shot inspector, expand Avatar. The audio row has two tabs:
- Upload — drop an mp3/wav you already have.
- From script — paste text, pick a voice, hit Generate. The selected
- provider synthesizes (via your key on BYOK, or Varosity Credits), audio
- uploads to Storage, attached to the shot in one click.
Voice cloning (ElevenLabs only)
/voices → Coming-soon UI for v2. Until then, clone via ElevenLabs's UI
and /voices will sync the cloned voice on next refresh.
Tips
- Keep narration scripts under 30 seconds per shot. Longer reads stretch
- past video durations and clip awkwardly.
- For dialogue (lip-sync to the on-screen face), use **Veo 3.1 + Full
- avatar mode** instead of TTS over a separate clip. Veo synthesizes
- the lip motion from the audio.
- TTS audio attached to a shot becomes its primary audio track in the
- stitch. Background music ducks under it automatically — see the music
- guide.