Voice over your video

Add narration with ElevenLabs, OpenAI TTS, or Deepgram.

Voice over your video

Three providers, one workflow. Pick the right one:

Provider	Strength	Pricing tier
ElevenLabs	Highest quality + voice cloning	$$
OpenAI TTS-1	Fast, reliable, six built-in voices	$
Deepgram Aura 2	Lowest latency, cheapest	$
Cartesia Sonic 2	Ultra-low latency, large voice library	$

Agents: discover valid voice ids first

Call list_voices before generate_voice. It returns every built-in voice with its voiceModel + exact voiceId (e.g. openai-tts-1 → nova, deepgram-aura-2 → aura-2-thalia-en). Voice ids are provider-specific — passing the wrong scheme (e.g. aura to Deepgram) safely falls back to that provider's default voice, but using the exact id gives you the voice you want.

Pull your voice library

1. Add the provider's key in Settings → Keys. 2. Visit /voices. Hit Sync on the provider's row. Your library lands in the grid. 3. Click ▶ on any voice to preview.

Generate audio for a shot

In the shot inspector, expand Avatar. The audio row has two tabs:

Upload — drop an mp3/wav you already have.
From script — paste text, pick a voice, hit Generate. The selected
provider synthesizes (via your key on BYOK, or Varosity Credits), audio
uploads to Storage, attached to the shot in one click.

Voice cloning (ElevenLabs only)

/voices → Coming-soon UI for v2. Until then, clone via ElevenLabs's UI and /voices will sync the cloned voice on next refresh.

Tips

Keep narration scripts under 30 seconds per shot. Longer reads stretch
past video durations and clip awkwardly.
For dialogue (lip-sync to the on-screen face), use **Veo 3.1 + Full
avatar mode** instead of TTS over a separate clip. Veo synthesizes
the lip motion from the audio.
TTS audio attached to a shot becomes its primary audio track in the
stitch. Background music ducks under it automatically — see the music
guide.