# Varosity — full content dump Source: https://varosity.ai/llms-full.txt Spec: https://llmstxt.org/ This file is the consolidated positioning, model registry, example workflows, and guides for Varosity. Generated from docs/guides/*.md and lib/providers/registry.ts on each build. ## Positioning > Your agent. Your accounts. Every model. Varosity is the API agents call when they need to generate video, voice, music, or images. The differentiator is BYOK (bring your own provider keys) at zero markup, combined with MCP-native access and multi-vendor aggregation across forty-plus frontier models. ### What Varosity does Varosity routes agent-initiated generation requests across providers including fal.ai, Runway, Luma, ElevenLabs, Suno, OpenAI, Replicate, Pika, Hailuo, and others. The customer can use their own provider accounts (zero markup, billed directly by the provider) or Varosity credits (no provider accounts needed; a small service fee applies). The MCP server, REST API, and CLI all use the same `vsk_` API key. ### What makes Varosity different 1. BYOK at zero markup. Bring fal, Runway, ElevenLabs, or any supported provider account. Varosity does not take a percentage. Competitors either don't support BYOK or charge 5% on it. 2. Agent-runtime-agnostic. Works with Claude Desktop, Claude Code, OpenClaw, Hermes, Cowork, Cursor, Windsurf, Codex, ChatGPT Apps, and any MCP-compatible runtime. The protocol is the contract. 3. Multi-vendor aggregation. Forty-plus frontier models across video, voice, music, and image. Single-vendor MCP servers (Higgsfield, MiniMax, Pictory) are walled gardens; Varosity routes across all of them. 4. Multi-shot stitching as a primitive. Chain different models into one MP4 render — Veo for one shot, Kling for the next, Runway for the closer — without managing ffmpeg yourself. 5. Skills for coding agents. The Varosity skills package installs into Claude Code, Cursor, OpenCode, Codex, and others via `npx -y skills add varosity-ai/varosity`. The skill teaches the agent when to reach for Varosity and how to use it correctly. ## API reference summary Base URL: https://varosity.ai/api/v1 Auth: Authorization: Bearer vsk_... OpenAPI 3.1: https://varosity.ai/api/openapi.json MCP endpoint: https://varosity.ai/api/mcp (Streamable HTTP, JSON-RPC 2.0) Key routes: - POST /v1/images — generate an image (synchronous; returns { imageUrl } directly, no polling). Body { prompt, model?, aspect_ratio? }. Default model flux-1-schnell; also nano-banana, imagen-4, flux-1.1-pro, dall-e-3, ideogram-v3. Requires the generate:image scope. - POST /v1/video/generate — generate video from prompt or reference image - POST /v1/tts — generate voice (TTS). Body { text, voiceId? } → { audioUrl }. (alias of /tts) - POST /v1/music — generate music. Body { modelId, prompt, durationSec } → { audioUrl }. (alias of /music) - POST /v1/projects/{id}/stitch — stitch a project's rendered shots into one MP4 - GET /v1/models — list available models with capabilities and pricing - POST /v1/route — recommend the best model for a shot description (Smart Route) - GET /v1/jobs/{id} — poll job status (queued → running → succeeded/failed) - POST /v1/webhooks — register webhook for job completion events Job shape: { "job_id": "vjob_...", "status": "queued|running|succeeded|failed|canceled", "model": "...", "estimated_cost_cents": 240, "poll_url": "..." } Error shape: { "code": "...", "message": "...", "recovery": "retry|rephrase|switch_model|configure_provider" } ## Example workflows ### Brand campaign video 1. Agent receives brief: '15-second brand video for a luxury watch launch' 2. Calls suggest_model → returns kling-3.0-pro (cinematic, image-to-video) 3. Calls generate_image (Flux) for reference keyframe 4. Calls generate_video (Kling 3.0 Pro, 6s) × 2 shots 5. Calls stitch → final 12s MP4 returned ### Product reel 1. Operator uploads product photos 2. Agent calls pick_reference_images → selects best three 3. Calls generate_video (Veo 3.1) per reference, 4s each 4. Calls generate_voice (ElevenLabs) for voiceover 5. Stitches video + audio → 12s MP4 ### News-to-shorts 1. Agent receives RSS item or URL 2. Calls generate_storyboard → 6 shots planned 3. Calls generate_video in parallel (Seedance 4.5) 4. Calls generate_music (Suno) for background 5. Renders and publishes via brand agent gate ## Skills Install: `npx -y skills add varosity-ai/varosity -a -y` Verified agents: claude-code, claude-desktop, hermes, cowork - varosity-multi-shot-consistency: https://varosity.ai/docs/multi-shot-consistency Lock reference image in Phase 1, vary model/prompt per shot in Phase 2. Music videos, product reels, brand campaigns. - varosity-mcp-agent-integration: https://varosity.ai/api/v1/skills/varosity-mcp-agent-integration Full integration reference: auth, all 35 tools, connection examples. - varosity-video-orchestration: https://varosity.ai/api/v1/skills/varosity-video-orchestration Director-mode workflow: storyboard → keyframes → renders → approval → stitch. - varosity-agent-video-sdk: https://varosity.ai/api/v1/skills/varosity-agent-video-sdk SDK-level reference for custom video agents. - custom-video-generation: https://varosity.ai/api/v1/skills/custom-video-generation Single-video pipeline: reference image pre-flight → render → delivery. ## Smart Route — picking the right model Call `POST /v1/route` with a shot description to get a ranked recommendation filtered to your configured providers. Quick rules for common shot types: - Dialogue / talking head / lip-sync → veo-3.1, seedance-4.5, omnihuman, muapi-latentsync, heygen-avatar-4, heygen-photo-avatar, heygen-video-translate - Camera control / motion brush → ws-runway-gen4 - Physics / complex simulation → hailuo-02, hunyuan-video, ws-hailuo-02 - Talking avatar from photo → omnihuman, heygen-avatar-4, heygen-photo-avatar, heygen-cinematic, heygen-video-agent, heygen-video-translate, d-id-talks - Fast / draft iteration → imagen-4-fast, ltx-video - Typography / text-in-image → ideogram-v3, dall-e-3 - Budget / lowest cost → kling-3.0, muapi-latentsync, d-id-talks - Video-to-video / editing → muapi-latentsync, heygen-video-translate, ws-runway-gen4 See also: routing.json (machine-readable), /docs/picking-the-right-model (human-readable) ## Model registry ### veo-3.1 - name: Veo 3.1 - vendor: fal - kind: video - status: stable - max duration: 8s - price: ~$0.15/s - native audio: yes - routing tags: lip-sync, audio-native, cinematic, talking-heads - strengths: best lip sync, native 4K, synced audio, talking heads ### kling-3.0 - name: Kling 3.0 Pro - vendor: fal - kind: video - status: stable - max duration: 10s - price: ~$0.1/s - native audio: yes - routing tags: cinematic, audio-native, budget, multi-shot - strengths: multi-shot consistency, cinematic motion, dialogue close-ups, 4K ### seedance-4.5 - name: Seedance 4.5 - vendor: fal - kind: video - status: stable - max duration: 12s - price: ~$0.14/s - native audio: yes - routing tags: audio-native, lip-sync, multi-shot - strengths: unified audio-video generation, multi-shot from one prompt, phoneme-level lip-sync ### sora-2 - name: Sora 2 - vendor: openai - kind: video - status: stable - unavailable: Sora 2 requires OpenAI Sora API access (not included in standard API keys). Contact OpenAI for access. - max duration: 20s - price: ~$0.15/s - native audio: yes - routing tags: cinematic, physics, premium - strengths: cinematic quality, prompt adherence, complex physics ### runway-gen-4.5 - name: Runway Gen-4.5 - vendor: runway - kind: video - status: stable - unavailable: Runway direct integration is not yet wired. Use ws-runway-gen4 (WaveSpeed) for Runway Gen-4 today. - max duration: 16s - price: ~$0.12/s - native audio: no - routing tags: camera-control, video-to-video, motion-brush - strengths: motion brush, camera control, video-to-video, director-grade editing ### omnihuman - name: OmniHuman 1.5 - vendor: fal - kind: video - status: stable - max duration: 60s - price: ~$0.08/s - native audio: no - routing tags: talking-avatar, lip-sync - strengths: talking avatars, from single photo, lip-sync from audio, drop-on-any-shot avatar layer - weaknesses: needs photo + audio, fixed framing, no scene control - docs: https://fal.ai/models/fal-ai/bytedance/omnihuman/v1.5 Drives a talking avatar from one photo + an audio clip. Drop on any shot in a brand agent's storyboard. ### kling-3.0-replicate - name: Kling 3.0 Pro (Replicate) - vendor: replicate - kind: video - status: stable - max duration: 10s - price: ~$0.09/s - native audio: yes - strengths: multi-shot consistency, cinematic motion, cheaper than direct - weaknesses: queue latency varies, fewer aspect ratios - docs: https://replicate.com/kwaivgi/kling-v3-video ### wan-2.5 - name: Wan 2.5 - vendor: replicate - kind: video - status: beta - unavailable: wan-video/wan-2.5 was removed from Replicate. Use ws-pika-2.2 or kling-3.0-replicate. - max duration: 8s - price: ~$0.07/s - native audio: no - strengths: realistic motion, fast, great for product shots - weaknesses: less cinematic than Kling, limited camera control - docs: https://replicate.com/wan-video/wan-2.5 ### pika-2.2 - name: Pika 2.2 - vendor: replicate - kind: video - status: beta - unavailable: pika-labs/pika-2.2 was removed from Replicate. Use ws-pika-2.2 (WaveSpeed) or pika-2.5. - max duration: 10s - price: ~$0.08/s - native audio: no - strengths: creative transitions, ingredient blends, stylized motion - weaknesses: less photoreal, occasional flicker - docs: https://replicate.com/pika-labs/pika-2.2 ### veo-3.1-direct - name: Veo 3.1 (Google direct) - vendor: google - kind: video - status: beta - unavailable: Direct Google Veo routing is not yet wired. Use veo-3.1 (FAL route) instead. - max duration: 8s - price: ~$0.14/s - native audio: yes - strengths: best lip sync, native 4K, synced audio, lower hop latency - weaknesses: regional availability varies - docs: https://ai.google.dev/gemini-api/docs/video ### luma-ray-3 - name: Luma Ray 3 - vendor: luma - kind: video - status: beta - unavailable: Direct Luma key is not authenticated — use ws-luma-ray-2 (Luma Ray 2 via WaveSpeed) on Credits. Re-enable once LUMA_PLATFORM_KEY is valid. - max duration: 10s - price: ~$0.14/s - native audio: no - strengths: natural motion, fluid camera moves, cinematic B-roll - weaknesses: lip sync weaker than Veo, shorter durations - docs: https://docs.lumalabs.ai/ ### pika-2.5 - name: Pika 2.5 - vendor: fal - kind: video - status: beta - max duration: 10s - price: ~$0.08/s - native audio: no - strengths: stylized motion, character consistency, fast - weaknesses: less photoreal than Kling - docs: https://pika.art/api ### hailuo-02 - name: MiniMax Hailuo 02 - vendor: hailuo - kind: video - status: beta - max duration: 10s - price: ~$0.11/s - native audio: no - routing tags: physics, complex-motion - strengths: physics realism, complex motion, long prompts - weaknesses: international API region latency - docs: https://www.minimaxi.com/en/platform ### runway-gen-4.5-direct - name: Runway Gen-4.5 (direct) - vendor: runway - kind: video - status: beta - unavailable: Runway direct integration is not yet wired. Use ws-runway-gen4 (WaveSpeed) for Runway Gen-4 today. - max duration: 16s - price: ~$0.12/s - native audio: no - routing tags: camera-control, video-to-video, motion-brush - strengths: motion brush, camera control, video-to-video, director-grade editing - weaknesses: no native audio, queue latency on free tier - docs: https://docs.dev.runwayml.com/ ### elevenlabs-tts - name: ElevenLabs Multilingual v2 - vendor: elevenlabs - kind: voice - status: stable - max duration: 600s - price: ~$0.005/s - native audio: yes - strengths: 29 languages, emotion control, voice cloning - weaknesses: streaming latency vs Cartesia - docs: https://elevenlabs.io/docs/api-reference/text-to-speech ### elevenlabs-tts-v3 - name: ElevenLabs v3 (Alpha) - vendor: elevenlabs - kind: voice - status: beta - max duration: 600s - price: ~$0.008/s - native audio: yes - strengths: dialogue style, highest expressiveness - weaknesses: alpha — quality varies - docs: https://elevenlabs.io/docs/models ### elevenlabs-sts - name: ElevenLabs Speech-to-Speech - vendor: elevenlabs - kind: voice - status: stable - max duration: 600s - price: ~$0.002/s - native audio: yes - strengths: voice conversion, preserves delivery/performance - docs: https://elevenlabs.io/docs/api-reference/speech-to-speech Voice changer — re-voice a source recording in a target voice. Backs POST /api/v1/speech-to-speech; billed per second of produced audio. ### elevenlabs-dubbing - name: ElevenLabs Dubbing - vendor: elevenlabs - kind: voice - status: stable - max duration: 2700s - price: ~$0.0084/s - native audio: yes - strengths: translate + dub to 30+ languages, speaker-aware - docs: https://elevenlabs.io/docs/api-reference/dubbing Async dubbing — translate and re-voice existing audio into another language. Backs POST /api/v1/dub; billed per second of produced audio (output ≈ source duration). ### elevenlabs-scribe - name: ElevenLabs Scribe (Speech-to-Text) - vendor: elevenlabs - kind: voice - status: stable - max duration: 7200s - price: ~$0.0000611/s - native audio: yes - strengths: transcription, speaker diarization, word-level timestamps - docs: https://elevenlabs.io/docs/capabilities/speech-to-text Speech-to-text + diarization primitive. Backs POST /api/v1/transcribe; billed per second of transcribed audio. ### elevenlabs-music - name: ElevenLabs Music - vendor: elevenlabs - kind: music - status: beta - max duration: 300s - price: ~$0.02/s - native audio: yes - strengths: commercially licensed training data, low-latency, fast - weaknesses: fewer genres than Suno, no separated stems - docs: https://elevenlabs.io/docs/api-reference/music ### minimax-music - name: MiniMax Music - vendor: fal - kind: music - status: stable - unavailable: fal-ai/minimax-music is a music-continuation model — it requires a reference_audio_url (a song with music + vocals), not pure text-to-music. Use Lyria 2 (instrumental) or Suno v4 (vocals) for text-to-music. - max duration: 60s - price: ~$0.01/s - native audio: yes - strengths: cheap, API-first, solid quality - weaknesses: requires a reference song, not pure text-to-music - docs: https://fal.ai/models/fal-ai/minimax-music ### lyria-2 - name: Google Lyria 2 - vendor: fal - kind: music - status: beta - max duration: 120s - price: ~$0.015/s - native audio: yes - strengths: instrumental detail, long-form composition - weaknesses: no vocals - docs: https://fal.ai/models/fal-ai/lyria2 ### flux-1-schnell - name: FLUX.1 [schnell] - vendor: fal - kind: image - status: stable - max duration: 0s - price: ~$0.003/s - native audio: no - strengths: fast (1–3s), good prompt adherence, low cost - weaknesses: no negative prompts, limited style control - docs: https://fal.ai/models/fal-ai/flux/schnell ### flux-1.1-pro - name: FLUX 1.1 Pro - vendor: replicate - kind: image - status: stable - max duration: 0s - price: ~$0.04/s - native audio: no - strengths: highest-quality Flux, strong prompt adherence, fine detail - weaknesses: ~5–10s per image, more expensive than schnell - docs: https://replicate.com/black-forest-labs/flux-1.1-pro ### aurora - name: Aurora - vendor: xai - kind: image - status: stable - unavailable: xAI Aurora is not yet wired with a platform key. Add your own xAI key via BYOK to enable, or use flux-1.1-pro / nano-banana / imagen-4 today. - max duration: 0s - price: ~$0.07/s - native audio: no - strengths: photorealism, native aspect ratio support, creative compositions, fast - weaknesses: newer API — model catalog still growing - docs: https://docs.x.ai/docs/guides/image-generations ### imagen-4 - name: Imagen 4 - vendor: google - kind: image - status: beta - max duration: 0s - price: ~$0.04/s - native audio: no - strengths: photorealism, fine detail, accurate anatomy, scene coherence - weaknesses: slower than Imagen 4 Fast (~10s), base64 only — no hosted URL - docs: https://ai.google.dev/gemini-api/docs/imagen ### imagen-4-fast - name: Imagen 4 Fast - vendor: google - kind: image - status: beta - max duration: 0s - price: ~$0.02/s - native audio: no - routing tags: fast - strengths: fast (~3–5s), photorealism, good for drafts + iterations - weaknesses: slightly lower detail than Imagen 4 - docs: https://ai.google.dev/gemini-api/docs/imagen ### ideogram-v3 - name: Ideogram V3 - vendor: fal - kind: image - status: stable - max duration: 0s - price: ~$0.04/s - native audio: no - routing tags: typography - strengths: best text-in-image, legible typography, poster / cover art - weaknesses: aspect ratios limited to fixed presets - docs: https://fal.ai/models/fal-ai/ideogram/v3 ### recraft-v3 - name: Recraft V3 - vendor: fal - kind: image - status: stable - max duration: 0s - price: ~$0.04/s - native audio: no - strengths: long detailed prompts, brand-consistent style, vector-style output - weaknesses: slower than Flux Schnell (~10–20s) - docs: https://fal.ai/models/fal-ai/recraft-v3 ### dall-e-3 - name: DALL-E 3 - vendor: openai - kind: image - status: stable - max duration: 0s - price: ~$0.04/s - native audio: no - routing tags: typography, creative - strengths: strong text-in-image, creative compositions, photorealism - weaknesses: square-leaning aspect ratios, rate-limited per OpenAI plan - docs: https://platform.openai.com/docs/api-reference/images ### nano-banana - name: Nano Banana - vendor: replicate - kind: image - status: beta - max duration: 0s - price: ~$0.005/s - native audio: no - strengths: very fast, good identity preservation, cheap, image editing - weaknesses: less photoreal than Flux Pro, occasional limb glitches - docs: https://replicate.com/google/nano-banana ### suno-v4 - name: Suno v4 - vendor: muapi - kind: music - status: beta - max duration: 240s - price: ~$0.025/s - native audio: yes - strengths: best vocal generation, genre + mood tags, full song structure - weaknesses: review licensing before commercial use - docs: https://muapi.ai/docs/music-and-speech Full vocal music generation via muapi — no provider key required, billed in Varosity Credits. ### muapi-flux-dev - name: Flux Dev - vendor: muapi - kind: image - status: stable - max duration: 0s - price: ~$0.015/s - native audio: no - strengths: 12B parameter model, strong prompt adherence, fast guided distillation - weaknesses: slower than Flux Schnell, review licensing for commercial use - docs: https://muapi.ai/docs/flux-dev 12B rectified flow transformer — higher quality than Schnell, faster than Flux Pro. No provider key needed. ### muapi-wan-effects - name: WAN Video Effects - vendor: muapi - kind: video - status: stable - max duration: 10s - price: ~$0.06/s - native audio: no - strengths: named effect catalog (Cakeify, Squish, VHS, Samurai…), frame consistency, platform-funded - weaknesses: short clips only (≤10s), artistic use — not photorealistic - docs: https://muapi.ai/docs/ai-video-effects Apply Cakeify, VHS, Samurai, Film Noir and 20+ other AI effects to images. Billed in Varosity Credits. ### muapi-latentsync - name: LatentSync Lip-Sync - vendor: muapi - kind: video - status: stable - max duration: 60s - price: ~$0.05/s - native audio: yes - routing tags: lip-sync, video-to-video, budget - strengths: smooth temporal consistency, fast inference, any video + audio - weaknesses: needs pre-existing video, no scene generation - docs: https://muapi.ai/docs/music-and-speech Sync lip movements to any audio track on an existing video. Billed in Varosity Credits. ### muapi-wan-t2v - name: WAN 2.1 Text-to-Video - vendor: muapi - kind: video - status: stable - max duration: 10s - price: ~$0.03/s - native audio: no - strengths: platform-funded (no BYOK), up to 720p / high quality, reliable fallback - weaknesses: $0.30/video flat rate, 5 or 10s clips only - docs: https://muapi.ai/docs WAN 2.1 text-to-video via muapi — platform-funded fallback when WaveSpeed is unavailable. No provider key needed. ### sd3.5-medium - name: SD3.5 Medium - vendor: stability - kind: image - status: stable - unavailable: Stability SD3.5 is not yet wired with a platform key. Add your own Stability key via BYOK to enable, or use flux-1-schnell / nano-banana today. - max duration: 0s - price: ~$0.035/s - native audio: no - strengths: fast (3–5s), good prompt adherence, lower cost than Large - weaknesses: lower detail ceiling than SD3.5 Large - docs: https://platform.stability.ai/docs/api-reference#tag/Generate ### sd3.5-large - name: SD3.5 Large - vendor: stability - kind: image - status: stable - unavailable: Stability SD3.5 is not yet wired with a platform key. Add your own Stability key via BYOK to enable, or use flux-1.1-pro / imagen-4 today. - max duration: 0s - price: ~$0.065/s - native audio: no - strengths: highest Stability quality, fine detail, photorealism - weaknesses: slower than Medium (~8–15s) - docs: https://platform.stability.ai/docs/api-reference#tag/Generate ### openai-tts-1 - name: OpenAI TTS-1 - vendor: openai - kind: voice - status: stable - max duration: 600s - price: ~$0.004/s - native audio: yes - strengths: fast, 6 voices, low latency - weaknesses: slightly lower quality than TTS-1 HD - docs: https://platform.openai.com/docs/api-reference/audio/createSpeech ### openai-tts-1-hd - name: OpenAI TTS-1 HD - vendor: openai - kind: voice - status: stable - max duration: 600s - price: ~$0.008/s - native audio: yes - strengths: highest OpenAI voice quality, 6 voices, natural prosody - weaknesses: 2× cost of TTS-1 - docs: https://platform.openai.com/docs/api-reference/audio/createSpeech ### deepgram-aura-2 - name: Deepgram Aura 2 - vendor: deepgram - kind: voice - status: stable - max duration: 600s - price: ~$0.002/s - native audio: yes - strengths: ultra-low latency, natural prosody, cheap ($0.030/1K chars) - weaknesses: English-only in Aura 2, fewer voice options than ElevenLabs - docs: https://developers.deepgram.com/docs/tts-rest ### cartesia-sonic-2 - name: Cartesia Sonic 2 - vendor: cartesia - kind: voice - status: stable - max duration: 600s - price: ~$0.006/s - native audio: yes - strengths: ultra-low latency (~90ms), natural prosody, large public voice library - weaknesses: voice ids are library-specific — call list_voices - docs: https://docs.cartesia.ai/api-reference ### fish-audio-tts - name: Fish Audio - vendor: fish-audio - kind: voice - status: stable - max duration: 600s - price: ~$0.004/s - native audio: yes - strengths: multilingual, cheap, large community voice library - weaknesses: voice ids are library-specific — call list_voices - docs: https://docs.fish.audio/ ### heygen-avatar-4 - name: HeyGen Avatar (Digital Twin) - vendor: heygen - kind: video - status: stable - max duration: 300s - price: ~$0.08/s - native audio: yes - routing tags: talking-avatar, lip-sync, premium - strengths: studio-grade lip sync, Avatar IV / Avatar V motion engines, voice emotion + speed control, captions & custom backgrounds - weaknesses: requires a pre-trained avatar look id, fixed framing - docs: https://developers.heygen.com/generate-avatar-video HeyGen's flagship talking-avatar (Digital Twin) on the v3 API — the highest-quality lip-sync available. Drives a pre-trained avatar look with a script (or your own audio), with the Avatar IV motion engine, voice emotion/speed/pitch, captions, and custom backgrounds. ### heygen-photo-avatar - name: HeyGen Photo Avatar (Avatar IV) - vendor: heygen - kind: video - status: stable - max duration: 300s - price: ~$0.08/s - native audio: yes - routing tags: talking-avatar, lip-sync, photo-to-video - strengths: animate ANY photo as the speaker, Avatar IV motion engine, motion prompt + expressiveness control - weaknesses: needs a clear front-facing photo, less consistent than a trained Digital Twin - docs: https://developers.heygen.com/photo-avatar Turn a single photo into a talking presenter with HeyGen's Avatar IV engine. Pass a photo URL + script and HeyGen animates the face with natural motion — no pre-training required. Control motion via a prompt and expressiveness level. ### heygen-cinematic - name: HeyGen Cinematic Avatar - vendor: heygen - kind: video - status: beta - max duration: 15s - price: ~$0.1/s - native audio: yes - routing tags: talking-avatar, cinematic - strengths: prompt-driven cinematic shots, blends 1–3 avatar looks into a scene, reference videos/images for style - weaknesses: 4–15s per clip, less control than a scripted avatar video - docs: https://developers.heygen.com/cinematic-avatar Prompt-driven cinematic shots featuring your avatar — describe a scene (camera, setting, action) and HeyGen renders a documentary-style clip from 1–3 avatar looks, optionally guided by reference clips/images. ### heygen-video-agent - name: HeyGen Video Agent - vendor: heygen - kind: video - status: beta - max duration: 600s - price: ~$0.12/s - native audio: yes - routing tags: talking-avatar, agent - strengths: prompt → finished video, agent writes script, picks avatar & scenes, accepts reference files - weaknesses: least granular control, longer render - docs: https://developers.heygen.com/docs/video-agent HeyGen's flagship agent: give it a prompt and it produces a complete video end-to-end — writing the script, choosing the avatar, and composing scenes. Attach reference files (images, docs) to ground the output. ### heygen-video-translate - name: HeyGen Video Translate - vendor: heygen - kind: video - status: stable - max duration: 600s - price: ~$0.1/s - native audio: yes - routing tags: talking-avatar, lip-sync, translation, video-to-video - strengths: multilingual lip-sync dubbing, preserves original speaker appearance, supports 40+ languages, speed or precision mode - weaknesses: requires source video with clear speech, slower than avatar generation - docs: https://developers.heygen.com/docs/video-translate Translate and dub an existing video into any of 40+ languages with perfectly lip-synced audio. Ideal for brands producing multilingual content from a single master video. Speed or precision mode, optional captions and speech enhancement. ### d-id-talks - name: D-ID AI Presenter - vendor: d-id - kind: video - status: stable - max duration: 300s - price: ~$0.05/s - native audio: yes - routing tags: talking-avatar, budget - strengths: talking avatars from any photo, text-to-presenter, fast render - weaknesses: fixed framing, requires source image for best results - docs: https://docs.d-id.com/reference/createtalk ### hunyuan-video - name: Hunyuan Video - vendor: fal - kind: video - status: stable - max duration: 10s - price: ~$0.09/s - native audio: no - routing tags: physics, cinematic - strengths: open-source quality, long coherent motion, strong physics - weaknesses: slow render time (60–120s), limited camera controls - docs: https://fal.ai/models/fal-ai/hunyuanvideo ### ltx-video - name: LTX Video - vendor: fal - kind: video - status: stable - max duration: 5s - price: ~$0.04/s - native audio: no - routing tags: fast - strengths: fastest open video model (<5s), image-to-video, good for iteration - weaknesses: lower detail than Kling/Veo, shorter max duration - docs: https://fal.ai/models/fal-ai/ltx-video ### ws-luma-ray-2 - name: Luma Ray 2 - vendor: wavespeed - kind: video - status: stable - max duration: 10s - price: ~$0.08/s - native audio: no - strengths: fluid motion, cinematic quality, strong prompt adherence - weaknesses: higher cost than Pika - docs: https://wavespeed.ai/docs/docs-api/luma/luma-ray-2-t2v Luma Ray 2 text-to-video via WaveSpeed. Fluid, cinematic motion. Billed in Varosity Credits. ### ws-pika-2.2 - name: Pika 2.2 - vendor: wavespeed - kind: video - status: stable - max duration: 10s - price: ~$0.04/s - native audio: no - strengths: fast generation, stylized output, good character consistency - weaknesses: shorter max duration - docs: https://wavespeed.ai/docs/docs-api/pika/pika-v2.2-t2v Pika 2.2 text-to-video via WaveSpeed. Fast and stylized. Billed in Varosity Credits. ### ws-hailuo-02 - name: Hailuo 02 - vendor: wavespeed - kind: video - status: stable - max duration: 6s - price: ~$0.08/s - native audio: no - routing tags: physics, complex-motion - strengths: physics realism, complex motion, high resolution - weaknesses: fixed 6s duration - docs: https://wavespeed.ai/docs/docs-api/minimax/minimax-hailuo-02-pro Hailuo 02 Pro via WaveSpeed. Best-in-class physics and complex motion. Billed in Varosity Credits. ### ws-runway-gen4 - name: Runway Gen 4 - vendor: wavespeed - kind: video - status: stable - max duration: 10s - price: ~$0.01/s - native audio: no - routing tags: camera-control, video-to-video - strengths: camera control, cinematic motion brush, video-to-video - weaknesses: requires reference image for best results - docs: https://wavespeed.ai/docs/docs-api/runwayml/runwayml-gen4-turbo Runway Gen 4 Turbo via WaveSpeed. Precise camera control and motion brush. Billed in Varosity Credits. ## Guides ### 5-minute first render > Sign in, add a key, render a clip — end-to-end in five minutes. # 5-minute first render You'll spend most of this on getting a fal.ai key. Once that lands, the actual render is one prompt away. ## 1. Get a fal.ai key (2 min) 1. Open [fal.ai/dashboard/keys](https://fal.ai/dashboard/keys), sign in, click **Create key**. Copy the value (shown once). 2. Add **\$5–\$10** of credit at [fal.ai/dashboard/billing](https://fal.ai/dashboard/billing). That's enough for ~30 short shots. ## 2. Add the key in Varosity (30 sec) Settings → **Keys** → paste into the **fal.ai** row → **Test** → **Save**. ## 3. Create your first shot (1 min) Dashboard → **+ New project** → in the empty filmstrip, **+ Add your first shot**. Pick a model (Kling 3.0 Pro is a great default), write a prompt: > *"Wide cinematic shot of a coastal cliff at golden hour, slow drone push-in, > ocean spray catching the light"* Set duration to 5 seconds. Click **Render this shot**. ## 4. Watch it render (~1 min) The shot card flips to "rendering" with the soft particle field. Polling runs at the studio level, so you can switch to other shots, close the inspector, even switch tabs — the render lands when it lands. ## 5. Done Click ▶ to play inline. Want to chain another shot? Hit **+ Insert after** between cards. Want a single MP4? **Render final** in the top right. ## What to try next - [Multi-shot storyboard](/docs/multi-shot-storyboard) — chain different models per shot - [Picking the right model](/docs/picking-the-right-model) — strengths and weaknesses cheat sheet - [Voice over your video](/docs/voice-over-your-video) — add narration with ElevenLabs ### Agent mode (Claude / Cursor / MCP) > Drive Varosity from any MCP host: 11 tools, bearer auth, JSON-RPC over Streamable HTTP. # Agent mode Varosity ships an MCP server at `/api/mcp`. Any MCP host (Claude Desktop, Cursor, custom clients via the official SDK) can drive every important capability without a browser. ## Quick setup 1. Issue a token at `/app/keys/api-keys` (label it for the host — "Claude Desktop on macbook"). 2. In your MCP host's config, add: ```json { "mcpServers": { "varosity": { "url": "https://varosity.ai/api/mcp", "transport": "streamable-http", "headers": { "Authorization": "Bearer vsk_" } } } } ``` For Claude Desktop, the file lives at: `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) 3. Restart the host. 4. The agent now has access to the Varosity tool surface. ## Tools exposed The MCP surface is **curated**, not a wrapper around the entire API. 11 tools, deliberately: | Tool | What it does | |------------------|---------------------------------------------------| | `list_models` | Filter the registry by kind (video/voice/music) | | `list_voices` | Your saved voice library | | `list_projects` | Recent projects | | `get_project` | Single project + shot list | | `create_project` | New empty project | | `generate_video` | Submit a video render job; returns jobId | | `generate_voice` | Synthesize speech via ElevenLabs | | `generate_music` | Generate a music track | | `suggest_model` | Smart Route — rank models for a shot description | | `get_job` | Status + output URL | | `render_project` | Stitch project to one MP4 | ## Hermes Skills for Multi-Shot Production For sophisticated multi-shot workflows, load the **varosity-multi-shot-consistency** Hermes skill: ```json { "skill": "varosity-multi-shot-consistency", "applies_to": ["music videos", "product reels", "brand campaigns", "interview sequences"], "enforces": "Locked reference image across all shots", "guides": "Two-phase workflow: Phase 1 (lock reference) → Phase 2 (generate shots in parallel)" } ``` See [Multi-shot consistency with locked references](/docs/multi-shot-consistency) for full details. ## Other agent-readable surfaces - [llms.txt](https://varosity.ai/llms.txt) — site overview - [llms-full.txt](https://varosity.ai/llms-full.txt) — registry + guides + skills - [agents.json](https://varosity.ai/.well-known/agents.json) — capability manifest - [openapi.json](https://varosity.ai/api/openapi.json) — full REST spec ## A typical agent flow > User: "Make a 15-second product reel for our new espresso machine. 3 shots." 1. Agent calls `suggest_model` for the establishing wide → ranks Kling 3.0. 2. Agent calls `create_project({title: "espresso reel"})` → projectId. 3. Agent calls `generate_video` 3 times across Kling/Veo/Seedance, returns jobIds. 4. Polls `get_job` until each lands. 5. Optional: `generate_music` for backing track, `generate_voice` for VO. 6. `render_project` for the final stitch. 7. Returns the MP4 URL to the user. All of that, no browser. ## Limits + safeguards - **Bearer tokens carry the user's full permissions.** Treat them like passwords. Revoke at `/app/keys/api-keys`. - **Every tool runs through RLS** — an agent with your token can only see/touch your data, never another user's. - **Rate limits** match the underlying providers'. The MCP server adds no synthetic limits in v2; budget caps on `/app/keys` apply. ### Avatar on any background > Composite a talking head on a Veo / Kling / Runway background. # Avatar on any background (the HeyGen-killer) HeyGen ships avatars on a fixed pipeline. Varosity composites a talking head on **any** of the frontier video models. Three modes: ## Mode: Off Ignore the attached avatar even if one is set. Useful when iterating on the BG without reburning OmniHuman credits. ## Mode: Full frame OmniHuman replaces the shot entirely. Good for plain talking-heads. Required: avatar photo + audio clip (or TTS script). ## Mode: Overlay (PiP) The killer feature. 1. Background renders via your chosen text-to-video model (Veo for lip sync if dialogue, Kling for cinematic, Seedance for audio-native…). 2. OmniHuman renders the talking head from a photo + audio in parallel. 3. Once both finish, Varosity runs `ffmpeg.wasm` in your browser to overlay the avatar in the chosen corner at the chosen size. 4. The composite replaces the shot's render_url. Stitches into the final MP4 like any other shot. ## How to set it up 1. Upload a high-resolution front-facing photo at `/avatars`. 2. In the shot inspector, expand **Avatar**. 3. Pick the avatar. 4. Either upload an audio clip OR switch to **From script** and paste text — ElevenLabs synthesizes (BYOK). 5. Switch **Layer mode** to **Overlay**. 6. Choose corner + size in the position picker. 7. Hit **Render this shot**. Two jobs go out; the compositor fires when both land. ## Tips - **Overlay corner** matters more than you think. Bottom-right reads as presenter; top-right as branding/sponsorship; top-left as primary caller. Pick deliberately. - **Avatar size 25–30%** is the sweet spot for most shots. Smaller reads as ambient; larger steals focus from the BG. - **Audio length sets shot duration in Full mode.** OmniHuman's output is bound by the audio length; ignore the duration slider. ### Background music + auto-ducking > Add a soundtrack to a project; voice ducks the music automatically. # Background music + auto-ducking ## Pick a model Three options on the `/music` page: | Provider | When to use | |--------------------|----------------------------------------------| | **ElevenLabs Music** | Commercial work — licensed training data. | | MiniMax | Cheap, fast, good for quick comps. | | Lyria 2 | Long-form instrumental composition. | Suno is feature-flagged off by default — there's no public API and the licensing is unclear. ElevenLabs Music is the safe bet for paid work. ## Generate a track `/music` → composer → write a vibe (e.g. *"cinematic synth, slow build, hopeful, sparse percussion"*) → set duration → **Generate**. Sync providers (ElevenLabs) drop the track immediately. Async providers (fal-routed) queue and poll; the track appears in the library when ready. ## Attach to a project Open the project → **Music** panel (top of the inspector when no shot is selected) → pick a track from the library → save. The stitch pipeline mixes the track under your video automatically. ## Auto-ducking When the project has voice/dialogue audio (TTS attached to any shot, or a Voice node), the music **ducks to -12dB** during voice and rides full volume between. Implemented with `ffmpeg`'s `sidechaincompress` filter. Toggle off in the project Music panel if you want flat music. ## Tips - **Match track duration to total shot length.** A 30s reel needs a 30s track — overflow clips at the end of the final stitch. - **Genre tags help.** Adding "no vocals" or "instrumental only" in the prompt prevents lyrics that fight your voiceover. - **Test the duck.** Render final once with ducking enabled, once with it off; pick whichever the video calls for. ### BYOK setup per provider > Where to get each provider's key + how to add it to Varosity. # BYOK setup per provider You bring the keys; Varosity routes the calls. Every key is encrypted at rest with AES-256-GCM. Plaintext never leaves the server, never gets logged, never appears in API responses. ## fal.ai (recommended first) Unlocks: Veo 3.1, Kling 3.0 Pro, Seedance 4.5, OmniHuman 1.5, Lyria 2 (instrumental music). 1. [fal.ai/dashboard/keys](https://fal.ai/dashboard/keys) → **Create key**. 2. Add credits at [fal.ai/dashboard/billing](https://fal.ai/dashboard/billing). 3. Settings → Keys → fal.ai row → paste → **Test** → **Save**. ## Replicate Unlocks: Kling 3.0 (alt route), Wan 2.5, Pika 2.2. 1. [replicate.com/account/api-tokens](https://replicate.com/account/api-tokens) → **Create token**. 2. Settings → Keys → Replicate → paste → Save. ## ElevenLabs Unlocks: TTS (every voice in your library), voice cloning, ElevenLabs Music. 1. [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys) → **Create**. 2. Make sure your subscription tier allows API access (Starter and up). 3. Settings → Keys → ElevenLabs → paste → Save. ## OpenAI (Sora 2) Unlocks: Sora 2 — *currently waitlist-only API.* Add the key now; Varosity flips the model from "unavailable" to live the moment OpenAI opens API access. 1. [platform.openai.com/api-keys](https://platform.openai.com/api-keys) → **Create**. 2. Settings → Keys → OpenAI → paste → Save. ## Runway Unlocks: Gen-4.5 direct (motion brush, camera control). 1. [docs.dev.runwayml.com](https://docs.dev.runwayml.com/) → API key in your Runway account settings. 2. Settings → Keys → Runway → paste → Save. ## Google AI Studio (Veo direct) Unlocks: Veo 3.1 via the Gemini API (lower latency than fal route). 1. [aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey) → **Create API key**. 2. Settings → Keys → Google → paste → Save. ## Fish Audio Unlocks: cheaper multilingual TTS. 1. [fish.audio/go-api](https://fish.audio/go-api) → API key. 2. Settings → Keys → Fish Audio → paste → Save. ## Cartesia Unlocks: ultra-low-latency TTS for real-time use. 1. [play.cartesia.ai/keys](https://play.cartesia.ai/keys) → **Create**. 2. Settings → Keys → Cartesia → paste → Save. ## Test before you save Every row has a **Test** button. It hits a lightweight endpoint at the provider, surfaces success/failure with the message. Save only after green. ## Rotate / revoke Click the row → **Rotate** prompts for a new key (replaces in place) or **Revoke** deletes it. Either action is immediate. ### Chaining clips across models > Use the right model for each shot type, then stitch them into one MP4. # Chaining clips across models The pitch: no single model is best at everything. Veo 3.1 owns lip-sync, Kling 3.0 owns cinematic motion, Seedance owns audio-native generation, Runway owns camera control. Varosity lets you mix-and-match per shot. ## A simple 4-shot example | Shot | Type | Best model | Why | |------|-----------------|--------------------|--------------------------------| | 01 | Wide establishing | Kling 3.0 Pro | cinematic motion, cheap | | 02 | Talking close-up | Veo 3.1 | best lip-sync, native audio | | 03 | Action beat | Seedance 4.5 | audio-native, multi-shot intent| | 04 | Hero close-up | Runway Gen-4.5 | camera control + motion brush | Set up: 1. Create a project. 2. Click **+ Add your first shot**, pick Kling, write the wide. 3. **+ Insert after**, pick Veo for the dialogue, paste your line. 4. Repeat for shots 03 and 04. 5. Hit **Render this shot + all downstream** to fan out renders. 6. Once everything's done, **Render final** stitches into one MP4. ## When you don't know which model to pick Click the **Smart Route** chip in the inspector. It sends your shot description to a routing LLM (OpenRouter / Claude Haiku) which ranks the top 3 candidates with reasoning. The chip pulls exclusively from your configured providers — it'll never recommend a model you can't use. ## Cost discipline The cost line in the inspector estimates per-shot at the registry's `pricePerSec`. The dashboard's Renders tab shows the total spent per final cut. There's no markup — what fal/Runway/etc. bill is what you pay. ## See also - [Picking the right model](/docs/picking-the-right-model) - [Prompting cheatsheet](/docs/prompting-cheatsheet) - [Background music and ducking](/docs/background-music-and-ducking) ### Director Mode > Plan, visualise, and approve video campaigns before a single frame renders. # Director Mode Director Mode is Varosity's pre-production layer. Instead of jumping straight into generation, you give a brand agent a brief, let the Director plan a full shoot (premise, visual style, characters, locations, shots, audio, continuity), and review the result as a storyboard before approving a render. ## Key concepts | Concept | What it is | |---|---| | **Storyboard** | A document that holds the brief, the Director's 8-section plan, and the visual output (grid or keyframes). | | **Grid mode** | One composite image showing all shots on a single sheet. Fast to generate; good for composition and layout review. | | **Keyframes mode** | One still per shot. Slower but gives you the exact frame that will be used as the seed for video generation. | | **Gate** | The approval step. You approve/revise/reject from the Storyboard Canvas or the Operator iOS app. | ## Workflow ``` Brief → Director plan → Grid or Keyframes → Approve → Render → Studio ``` 1. **Create a brand agent** at `/app/agents/new`. Set the identity, voice, visual system, and cadence. 2. **Open a project** and click the **Storyboard** tab in the top bar (or the Director CTA if you're starting from an empty project). 3. **Write a brief** — one or two sentences is enough. The Director fills in the rest. 4. **Review the plan** — expand any of the 8 sections (Premise, Visual Style, Characters, Locations, Shots, Audio, Continuity, Mode Recommendation) and edit directly. 5. **Generate a visual** — click **Generate grid** or switch to Keyframes mode and generate per-shot stills. 6. **Long-press a keyframe** on iOS (or hover + click the ↺ button on web) to regenerate just that shot. 7. **Approve** from the action row. The Director materialises the plan into a Studio project and kicks off generation. ## Brand agent setup Director Mode is linked to a brand agent. The agent provides: - **Identity** — name, slug, tagline, audience, mission. The Director uses these as background context for every plan. - **Visual system** — default image model, aesthetic descriptors. These flow into keyframe generation prompts. - **Cadence & approval policy** — how often the agent runs autonomously and whether a human gate is required. - **Publishing** — Postiz endpoint and token for automated posting after approval. ## REST API All storyboard operations are available via the REST API. Authenticate with a `vsk_*` bearer token. ```bash # Create a storyboard curl -X POST https://app.varosity.com/api/v1/storyboards \ -H "Authorization: Bearer vsk_..." \ -H "Content-Type: application/json" \ -d '{"brand_id": "...", "brief": {"prompt": "Summer product launch"}}' # Generate keyframes curl -X POST https://app.varosity.com/api/v1/storyboards/{id}/keyframes \ -H "Authorization: Bearer vsk_..." # Approve + render curl -X POST https://app.varosity.com/api/v1/storyboards/{id}/render \ -H "Authorization: Bearer vsk_..." ``` See the full [OpenAPI spec](/api/openapi.json) or the [Agent Guide](/agent-guide) for MCP-based workflows. ## Operator iOS The Operator app shows storyboards under each brand. Tap a storyboard to see the grid or keyframes strip. Long-press any keyframe to queue a regeneration. Approve, revise, or reject from the detail screen — the same gate decision hits the REST API with an idempotency key so offline taps don't double-apply. ## Tier requirements Director Mode is available on all plans during the early-access period. It is gated behind the `ff_v4_director` feature flag. Dogfood users are opted in via the `V4_DOGFOOD_USER_IDS` environment variable; everyone else gets access when the flag flips globally at the v4.0 release. ## Tips - **Mode Recommendation** is auto-populated by the Director's heuristic router. Grid is suggested for lifestyle/product campaigns; Keyframes for anything with recurring talent, day-to-night progressions, or cinematic continuity. - If a keyframe prompt isn't landing, expand the **Shots** section and edit the `keyframe_hint` field for that shot before regenerating. - You can save a storyboard as a template (Action row → Save as template) to reuse the plan structure across future campaigns. ### Managing renders + costs > Where renders live, how to download, and how to keep BYOK costs under control. # Managing renders + costs Every stitched MP4 lives at `/renders` with a download link, the duration, and the per-render BYOK cost in dollars. There's no markup — what fal/ElevenLabs/etc. bill is what you see. ## Per-render cost Each shot's `cost_cents` is stamped at completion based on the registry's `pricePerSec` × duration. The `/renders` table sums them per project for the total. > Heads-up: provider pricing changes. The registry is updated in > migrations; if your in-flight bill diverges from what we display, > the provider's number wins. ## Per-month cost cap Settings → Keys → each provider row has a **Budget cap** in dollars. When the month-to-date spend hits the cap, all further generations for that provider 402 with `BudgetExceededError`. Reset on the 1st. Set it to whatever you can lose without thinking. Default: $0 (no cap). Recommended: $25 to start. ## Cleanup Renders are kept indefinitely. Delete from the table when you don't need them. Storage charges are negligible on the BYOK plan; we don't charge for storage in v2. ## Why a render failed The shot card shows the provider's error message inline (e.g. `fal balance exhausted`, `Invalid prompt: must be < 2000 chars`). For deeper failures, check the inspector's error banner — it surfaces the full provider response. ### Multi-shot consistency with locked references > Enforce visual continuity across multi-shot video sequences. Lock one reference image, vary model and prompt per shot. # Multi-shot consistency with locked references When producing multi-shot video sequences with Varosity, **visual consistency across shots is critical**. Without a locked reference image, each shot will diverge visually, creating jarring cuts and broken narrative flow. This guide enforces a mandatory two-phase workflow: 1. **Reference Image Selection** — you explicitly pick a locked reference from candidates 2. **Reference Binding** — all shots use the same reference URL; only model and prompt vary per shot ## When This Matters - **Music videos** — same performer, multiple angles - **Product reels** — same product, different lighting/framing - **Brand content** — same subject, multiple perspectives - **Interview sequences** — same person, varied shot types - **Travel montages** — same location context, varied framing - **Any production where visual continuity is non-negotiable** ## Why Not Auto-Select? Because consistency is a creative decision. Auto-selection removes your judgment about the aesthetic baseline. This workflow gates that choice explicitly — you lock the reference, not the algorithm. --- ## Phase 1: Reference Image Selection **Goal:** Lock down ONE reference image URL that will apply to all shots in the production. ### Steps 1. **You provide a reference or we generate candidates** - If you supply a reference image directly → we lock it immediately - If not → proceed to step 2 2. **Generate candidate reference images** - Varosity generates 3–4 reference options - Each shows your subject under different lighting/aesthetics - Examples: "bright stage", "moody amber", "neutral", "cinematic" 3. **You select the reference** - View all candidates side by side - Pick the ONE that locks the visual tone for your entire video - We **do not auto-select** — your choice matters 4. **Reference is locked** - Your selection gets stored with a timestamp - Tagged as "LOCKED_REFERENCE" - This URL is now immutable for this production ### Example Phase 1 ``` You: "Create a 4-shot country music performance video. Same performer, different angles." Varosity: "Let me generate reference options for your performer aesthetic." [Displays 4 reference images] Option 1: Bright, well-lit stage (clinical feel) Option 2: Moody amber lighting (intimate feel) ← Your choice Option 3: Neutral professional backdrop Option 4: Dramatic side-lit (artistic feel) You: "Option 2 — the moody amber lighting feels right." Varosity: "✓ Locked reference: Option 2. All shots will now use this reference. Moving to Phase 2: generating individual shots." ``` --- ## Phase 2: Locked Reference Binding **Goal:** Generate all shots with the same `reference_image_url`, varying only model and prompt. **Immutable:** `reference_image_url` (locked from Phase 1) **Variable:** `model`, `prompt`, `shot_type` ### For Each Shot 1. **Varosity suggests the optimal model** for your shot description - "Wide establishing shot" → Kling 3.0 (environmental context) - "Close-up detail" → Veo 3.1 (fine detail + motion) - "Micro-movements" → Seedance 4.5 (smooth, fluid motion) 2. **You confirm the model** (or we auto-pick the top recommendation) 3. **We generate the shot** with: - Your shot-specific prompt (angle, framing, action) - The LOCKED reference from Phase 1 - The chosen model 4. **All shots submit in parallel** - Don't wait for shot 1 to finish before submitting shot 2 - Faster turnaround ### Example Phase 2: Country Music Video ``` Shot 1: Wide establishing (kling-3.0) Prompt: "wide stage view of country performer at microphone, intimate venue, moody amber lighting" Reference: [LOCKED from Phase 1] Shot 2: Close-up face during chorus (veo-3.1) Prompt: "close-up face during passionate chorus, warm lighting, emotional expression" Reference: [LOCKED from Phase 1] Shot 3: Hands on instrument (seedance-4.5) Prompt: "detailed hands on guitar strings, fingerstyle technique, intimate lighting" Reference: [LOCKED from Phase 1] Shot 4: Performer + crowd (kling-3.0) Prompt: "performer and front row audience engaged, warm moody venue lighting" Reference: [LOCKED from Phase 1] All 4 shots submitted in parallel. All use the same reference. Only model and prompt vary. Result: Visually consistent, performance-optimized video. ``` --- ## Why This Matters: Visual Continuity **Without locked references:** - Shot 1 rendered by Veo might show moody amber lighting - Shot 2 rendered by Kling might show bright cool lighting - Shot 3 rendered by Seedance might show neutral tones - Result: jarring cuts, broken narrative flow, viewer disorientation **With locked references:** - All 4 shots reference the same aesthetic baseline - Models can optimize for detail/motion *within* that constraint - Result: unified look, professional feel, narrative coherence --- ## Reference Image FAQ ### "Can I change the reference mid-production?" No. Changing the reference mid-production breaks continuity. If you want a different aesthetic, start a new production with a new locked reference. ### "What if I don't like any of the reference candidates?" Request regeneration. Varosity will generate 3–4 new options with different prompts or variations. Keep iterating until you find one that resonates. ### "Can multiple productions share the same reference?" Yes, but not recommended. Each production should lock its own reference to ensure full control over aesthetics and avoid unexpected cross-production drift. ### "What happens if the reference image expires?" Varosity maintains reference URLs. If a reference does expire or fail, you'll see an error and can regenerate a new set of candidates and re-lock. ### "Do all shots have to match the reference exactly?" No. The reference *guides* consistency, not enforces pixel-perfect matching. Models still vary by prompt — a close-up will look different from a wide shot. The reference ensures they feel like they're in the same visual world. --- ## Common Pitfalls ### Pitfall 1: Not Blocking Until Selection **Mistake:** Agent auto-selects the "best" reference without waiting for your input. **Fix:** Always wait for explicit user confirmation before locking. Consistency is a creative choice. ### Pitfall 2: Reference Mismatch in Late Shots **Mistake:** Shot 3 accidentally uses a different reference URL than Shots 1–2. **Fix:** Verify all shots list the same reference before submission. Most agents show you the locked reference URL for confirmation. ### Pitfall 3: Sequential Generation Instead of Parallel **Mistake:** Agent waits for Shot 1 to complete before submitting Shot 2. **Fix:** All shots should submit in parallel for faster turnaround. Reference locking enables this. ### Pitfall 4: Identical Prompts for All Shots **Mistake:** All shots use the same prompt (e.g., "wide stage view"). **Fix:** Vary the prompt per shot (wide → close-up → detail) while keeping the reference locked. Variety in framing + consistency in aesthetic. ### Pitfall 5: Forgetting to Verify Job Status **Mistake:** Agent assumes all shots succeeded and stitches without checking. **Fix:** Poll job status for each shot before final render. One failed shot breaks the sequence. --- ## Workflow Guardrails | Rule | Status | Violation | Fix | |------|--------|-----------|-----| | Lock reference in Phase 1 | MUST | Attempt to generate without Phase 1 lock | Agent FAILS. Repeats Phase 1. | | Use same reference URL for all shots | MUST | Any shot uses different URL | Agent FAILS. Regenerates with correct reference. | | Block until user selects | MUST | Auto-select without confirmation | Agent FAILS. Prompts for explicit selection. | | Suggest model per shot | SHOULD | Generate without model suggestion | Agent WARNS. Consider calling suggest_model() for optimization. | | Vary prompts per shot | SHOULD | All shots use identical prompt | Agent WARNS. May reduce visual variety. | | Generate in parallel | SHOULD | Sequential generation | Agent WARNS. Parallel generation is faster. | | Poll job status | SHOULD | Assume success without checking | Agent WARNS. Some shots may have failed silently. | --- ## Integration: Using This Skill with Your Agent ### If You're Using Claude Desktop The `varosity-multi-shot-consistency` Hermes skill is available. Ask your agent: > "Create a 3-shot product reel using the multi-shot consistency skill. Lock a reference first." ### If You're Using MCP Directly The skill orchestrates these Varosity MCP tools: - `mcp_varosity_pick_reference_images` — Phase 1: candidate generation - `mcp_varosity_suggest_model` — Phase 2: per-shot model ranking - `mcp_varosity_generate_video` — Phase 2: shot generation with locked reference - `mcp_varosity_get_job` — Poll job status - `mcp_varosity_render_project` — Final stitch ### If You're Using the REST API Call `/v1/video/generate` with: - `reference_image_url`: your locked reference (same for all shots) - `prompt`: shot-specific prompt - `model`: chosen model --- ## Next Steps - **See the full Hermes skill:** [varosity-multi-shot-consistency on GitHub](https://github.com/nous-research/hermes-skills/tree/main/mlops/varosity-multi-shot-consistency) - **Try an example:** Ask your agent to "Create a 4-shot music video using multi-shot consistency" - **Read more guides:** [Agent mode for MCP integration](/docs/agent-mode) · [Picking the right model](/docs/picking-the-right-model) ### Picking the right model > Strengths and weaknesses cheat sheet across every video, voice, and music model. # Picking the right model This is also rendered live in the **Smart Route** chip in the inspector based on your shot description. The cheat sheet below is the static map. ## Video models ### Veo 3.1 - **Pick when**: dialogue close-ups, lip-sync matters, you want native audio. - **Skip when**: you need camera control or video-to-video. - **Cost**: ~\$0.15/s. ### Kling 3.0 Pro - **Pick when**: cinematic motion, multi-shot consistency, mid-budget cinematic action. - **Skip when**: lip-sync matters more than motion. - **Cost**: ~\$0.10/s direct, ~\$0.09/s via Replicate. ### Seedance 4.5 - **Pick when**: you want the audio + video generated together, multi-shot from one prompt. - **Skip when**: you need fine motion control. - **Cost**: ~\$0.14/s. ### Runway Gen-4.5 - **Pick when**: motion brush, camera path control, video-to-video editing. - **Skip when**: you want native audio (Runway is video-only). - **Cost**: ~\$0.12/s. ### Wan 2.5 (Replicate) - **Pick when**: realistic motion (product shots), fast turnaround. - **Skip when**: you want stylized cinematography. - **Cost**: ~\$0.07/s. ### Pika 2.2 (Replicate) - **Pick when**: creative transitions, ingredient blends, stylized motion. - **Skip when**: you need photoreal output. - **Cost**: ~\$0.08/s. ### OmniHuman 1.5 - **Pick when**: talking-head avatar from a single photo. - **Skip when**: you don't have an audio clip. - **Cost**: ~\$0.08/s. ### Sora 2 - **Pick when**: most cinematic shot, complex physics — *if you have access*. - **Skip when**: API is waitlist-only (still as of 2026-04). ## Voice models > Call `list_voices` for the exact voice ids per provider — voice ids are > provider-specific (Deepgram wants `aura-2-thalia-en`, OpenAI wants `nova`, > etc.). Passing the wrong scheme falls back to that provider's default voice. ### ElevenLabs Multilingual v2 - **Pick when**: highest-quality voiceover, voice cloning needed. - **Cost**: ~\$0.005/s. ### ElevenLabs v3 (Alpha) - **Pick when**: dialogue / two-speaker scripts, max expressiveness. - **Skip when**: alpha quality variance is a deal-breaker. ### OpenAI TTS-1 / TTS-1 HD - **Pick when**: fast, reliable narration; six built-in voices (alloy, nova, …). - **Cost**: ~\$0.004/s (TTS-1), ~\$0.008/s (HD). ### Deepgram Aura 2 - **Pick when**: lowest latency and cheapest; conversational English voices. - **Cost**: ~\$0.002/s. ### Cartesia Sonic 2 - **Pick when**: ultra-low latency with a large public voice library. - **Cost**: ~\$0.006/s. ## Music models ### ElevenLabs Music - **Pick when**: commercial work, licensed audio required. - **Cost**: ~\$0.02/s. ### Suno v4 - **Pick when**: full songs with vocals; genre/mood tags. - **Cost**: ~\$0.025/s. ### Lyria 2 - **Pick when**: instrumental composition, longer-form tracks. - **Cost**: ~\$0.015/s. > **Note:** MiniMax Music is *not* a text-to-music model — it's a continuation > model that requires a reference song. For text-to-music use Suno (vocals) or > Lyria 2 (instrumental). ## Use the Smart Route In doubt? Open the inspector → **Smart Route** chip → describe the shot in plain English → it ranks the top 3 with reasoning. Pulls only from your configured providers. ### Prompting cheatsheet > Camera, lighting, and motion language that lands consistently across models. # Prompting cheatsheet Frontier video models read cinematography vocabulary. Use it. ## Shot framing - **Wide / establishing** — full environment, character small - **Full** — character head-to-toe - **Medium** — character waist-up - **Medium close-up** — chest-up; the conversational frame - **Close-up** — face fills frame - **Extreme close-up** — single feature (eye, hand, object) - **Over-the-shoulder** — POV-adjacent for dialogue - **Bird's eye / overhead** — top-down - **Low angle** — looking up at subject (heroic / threatening) ## Camera moves - **Static / locked** — no movement - **Push-in / dolly in** — moving toward subject - **Pull-out / dolly out** — moving away - **Pan** — horizontal pivot - **Tilt** — vertical pivot - **Tracking / follow** — moving alongside subject - **Crane / boom** — vertical lift - **Whip pan** — fast pan with motion blur - **Dutch angle** — tilted horizon (tension) ## Lighting - **Golden hour** — warm low sun, long shadows - **Blue hour** — twilight, cool palette - **Rembrandt** — single key light, triangle on the cheek - **Practical** — visible in-frame light source (lamp, screen) - **High key** — bright, low contrast (commercial) - **Low key** — dark, high contrast (drama) - **Backlit / silhouette** — light behind subject - **Soft window light** — diffuse natural ## Motion + pacing - **Slow-motion** / **120fps** — extreme slow-mo - **Time-lapse** — accelerated time - **Hand-held** — slight shake, organic feel - **Steady-cam** — smooth despite movement - **Static then push** — first second still, then push-in ## Texture / look - **35mm film** / **16mm film** — grain, character - **Anamorphic** — 2.39:1 with horizontal flares - **Cinematic, shallow depth of field** - **High contrast, color graded** ## Avoiding common failure modes - **Don't over-specify motion in 5s shots.** Models compress motion; asking for "running through a forest, jumping over a log, then climbing a tree" in 5s yields a blur. Pick one motion beat. - **Subject FIRST, then style.** Models read the first 30 tokens most closely. Lead with what's in frame. - **One scene per shot.** Cuts within a shot are unreliable across all models. Use the storyboard for cuts. ## Templates that work ### Talking head > *Medium close-up of [person], [emotion]. [Lighting]. Slight slow zoom-in. Cinematic, shallow depth of field, 35mm.* ### Product reveal > *Overhead top-down shot of [product] on [surface], [lighting]. Slow downward dolly. Minimalist editorial.* ### Action beat > *[Subject] [doing one action] in [environment], [time of day]. Tracking shot. Hand-held energy.* ### B-roll > *[Setting] at [time], [lighting], [weather]. Slow [motion]. Quiet, observational.* ### Provider options (advanced knobs) > Every per-model providerOptions field across image, video, and voice — exact names, allowed values, and examples. # Provider options (advanced knobs) Every generation endpoint accepts an optional **`providerOptions`** object that is forwarded verbatim to the underlying model. It is how you reach a provider's *full* capability — resolution, negative prompts, voice emotion, output format, and more — beyond the common fields (`prompt`, `aspectRatio`, `durationSec`). Rules of the road: - **Optional.** Omit `providerOptions` entirely and you get sensible defaults. - **Per-provider.** Each model reads only the fields it understands. Fields a model doesn't support are ignored (or rejected by that provider with a clear error) — they never silently corrupt a request. - **Forwarded on three endpoints:** `POST /api/v1/images`, `POST /api/v1/video/generate`, and `POST /api/tts` (voice also takes top-level `format` and `speed`). `POST /api/music` uses its own fields (`tags`, `instrumental`, `lyrics`, `title`) instead. All field names below are confirmed against each provider's live API. --- ## Images — `POST /api/v1/images` ```jsonc { "prompt": "a glossy red paperclip on white, studio light", "modelId": "flux-1.1-pro", "aspectRatio": "1:1", "providerOptions": { "outputFormat": "jpg", "outputQuality": 80 } } ``` ### OpenAI — `dall-e-3` (gpt-image-1) | Field | Type / values | Notes | |---|---|---| | `quality` | `"low"` \| `"medium"` \| `"high"` \| `"auto"` | render quality | | `background` | `"transparent"` \| `"opaque"` \| `"auto"` | transparent needs png/webp | | `outputFormat` | `"png"` \| `"jpeg"` \| `"webp"` | | | `outputCompression` | `0`–`100` | jpeg/webp only | | `moderation` | `"low"` \| `"auto"` | | ### fal — `flux-1-schnell`, `ideogram-v3`, `recraft-v3` | Field | Applies to | Type / values | |---|---|---| | `numInferenceSteps` | flux | integer (schnell 1–12, optimal 4) | | `enableSafetyChecker` | flux | boolean | | `guidanceScale` | flux (image-to-image edit) | number | | `style` | ideogram, recraft | string (recraft: `"realistic_image"`, `"digital_illustration"`, `"vector_illustration"`, …) | | `negativePrompt` | ideogram | string (flux/recraft have no negative prompt) | | `renderingSpeed` | ideogram | `"TURBO"` \| `"BALANCED"` \| `"QUALITY"` | | `expandPrompt` | ideogram | boolean | | `strength` | any (image-to-image edit) | `0`–`1`, how far to deviate from the source | ### Replicate — `flux-1.1-pro` | Field | Type / values | Notes | |---|---|---| | `outputFormat` | `"png"` \| `"jpg"` \| `"webp"` | | | `outputQuality` | `0`–`100` | jpg/webp | | `safetyTolerance` | `1`–`6` | 1 = strict, 6 = lax | | `promptUpsampling` | boolean | | | `imagePrompt` | image URL | Flux Redux image-to-image (also falls back to `referenceImageUrl`) | | `width`, `height` | `256`–`1440` | both required; overrides the aspect preset | > **Note:** `flux-1.1-pro` has no `steps`/`guidance` fields (those belong to > `flux-dev`). ### Google — `imagen-4`, `imagen-4-fast` | Field | Type / values | |---|---| | `personGeneration` | `"dont_allow"` \| `"allow_adult"` \| `"allow_all"` | > **Note:** Imagen 4 does **not** support `negativePrompt`, `seed`, or > `addWatermark` — sending them returns HTTP 400, so they are dropped. --- ## Video — `POST /api/v1/video/generate` ```jsonc { "prompt": "a red paper airplane glides over a city, smooth aerial motion", "modelId": "seedance-4.5", "aspectRatio": "16:9", "durationSec": 5, "providerOptions": { "resolution": "1080p", "cameraFixed": true } } ``` ### fal — `veo-3.1`, `kling-3.0`, `seedance-4.5` | Field | Applies to | Type / values | |---|---|---| | `resolution` | veo, seedance | veo: `"720p"` \| `"1080p"` \| `"4k"`; seedance: `"480p"` \| `"720p"` \| `"1080p"` | | `safetyTolerance` | veo | `"1"`–`"6"` (string) | | `autoFix` | veo | boolean (rewrite prompt to pass moderation) | | `generateAudio` | veo, kling | boolean — native audio, **default true**; set `false` for silent | | `cfgScale` | kling | number (guidance) | | `shotType` | kling | `"customize"` \| `"intelligent"` | | `cameraFixed` | seedance | boolean (locked camera) | | `enableSafetyChecker` | seedance | boolean | ### WaveSpeed (Varosity Credits) — `ws-pika-2.2`, `ws-luma-ray-2`, `ws-hailuo-02`, `ws-runway-gen4` | Field | Applies to | Type / values | |---|---|---| | `seed` | all | number | | `negativePrompt` | all | string | | `enablePromptExpansion` | hailuo | boolean | | `loop` | luma | boolean | | `size` | pika/luma (text-to-video) | `"1280*720"` \| `"720*1280"` | | `image` | pika/luma/hailuo | image URL — auto-routes to image-to-video | > Passing a `referenceImageUrl` on pika/luma/hailuo also triggers > image-to-video automatically. ### muapi (Varosity Credits) — `muapi-wan-t2v`, `muapi-wan-effects` | Field | Applies to | Type / values | |---|---|---| | `resolution` | both | `"480p"` \| `"720p"` | | `quality` | both | `"medium"` \| `"high"` | | `negativePrompt` | wan-t2v | string | | `effect` | wan-effects | effect name, e.g. `"Cakeify"`, `"Inflate"`, `"VHS Footage"`, `"Samurai It"`, `"Film Noir"` (validated against muapi's catalog — unknown names error) | ### Replicate — `kling-3.0-replicate` | Field | Type / values | |---|---| | `generateAudio` | boolean (default true; `false` = silent) | | `endImage` | image URL (last-frame keyframe) | ### Hailuo direct — `hailuo-02` *(gated; Credits route uses `ws-hailuo-02`)* | Field | Type / values | |---|---| | `resolution` | `"768P"` \| `"1080P"` (1080P is 6s-only) | | `promptOptimizer` | boolean | --- ## Voice / TTS — `POST /api/tts` Voice takes two top-level fields plus `providerOptions`: ```jsonc { "text": "Hello from Varosity.", "voiceModel": "cartesia-sonic-2", "format": "wav", // mp3 | wav | ogg | opus | aac | flac | pcm "speed": 1.1, // OpenAI 0.25–4.0; others ignore "providerOptions": { "language": "en", "emotion": ["positivity:high"] } } ``` ### OpenAI — `openai-tts-1`, `openai-tts-1-hd` Uses the top-level `speed` (0.25–4.0) and `format` (mp3/opus/aac/flac/wav/pcm). No extra `providerOptions`. ### Cartesia — `cartesia-sonic-2` | Field | Type / values | |---|---| | `language` | `"en"`, `"es"`, `"fr"`, … (sonic-2 is multilingual) | | `speed` | `"slowest"` \| `"slow"` \| `"normal"` \| `"fast"` \| `"fastest"` | | `emotion` | string[] e.g. `["positivity:high", "curiosity:low"]` | | `container` | `"mp3"` \| `"wav"` \| `"raw"` | | `sampleRate` | number (default 44100) | | `bitRate` | number (mp3, default 128000) | | `encoding` | `"pcm_s16le"`, `"pcm_f32le"`, … (wav/raw) | ### Fish Audio — `fish-audio-tts` | Field | Type / values | |---|---| | `model` | `"speech-1.5"` \| `"speech-1.6"` \| `"s1"` (default `speech-1.6`) | | `mp3Bitrate` | `64` \| `128` \| `192` | | `normalize` | boolean | | `latency` | `"normal"` \| `"balanced"` | | `speed` | number (prosody, 1.0 = normal) | | `volume` | number (prosody) | | `temperature`, `topP` | number | ### Deepgram — `deepgram-aura-2` | Field | Type / values | |---|---| | `encoding` | `"mp3"` \| `"linear16"` \| `"flac"` \| `"opus"` \| `"aac"` \| `"mulaw"` \| `"alaw"` | | `sampleRate` | number (required for linear16) | | `bitRate` | number (mp3/opus/aac) | | `container` | `"none"` \| `"wav"` \| `"ogg"` | > Tip: passing top-level `"format": "wav"` is enough — Deepgram maps it to > `linear16` in a WAV container for you. ### ElevenLabs — `elevenlabs-tts` Uses top-level `speed`, `format`, plus `stability`, `similarity`, `style`, `speakerBoost` on the request body. --- ## Music — `POST /api/music` Music does not use `providerOptions`. Use the dedicated fields: `tags` (genre/mood), `instrumental` (boolean), `lyrics` (custom-lyrics mode), and `title`. ### Save as workflow template > Turn a project into a reusable template; instantiate from UI, CLI, or MCP. # Save as workflow template Built a project structure you'll want to reuse? Save it as a workflow template. The graph (shots, models, durations, music attachment, voice attachment) gets snapshotted; outputs are NOT included. ## From the studio Open a project → top-right overflow menu → **Save as template**. - **Name** — the human-friendly title (e.g. "Founder podcast promo") - **Description** — what this template is for - **Inputs** — variables the user fills in when instantiating (e.g. `host_name`, `episode_title`, `cta_url`) - **Public** — opt-in to share at `/workflows` ## Instantiating ### From the UI Dashboard → **Start from template** → pick yours from the grid. ### From the CLI ```bash varosity workflow run founder-podcast-promo \ --input host_name="Jon Kludt" \ --input episode_title="Building Varosity" \ --out promo.mp4 ``` ### From MCP ``` { "tool": "run_workflow", "args": { "slug": "founder-podcast-promo", "inputs": { "host_name": "Jon Kludt", "episode_title": "Building Varosity" } }} ``` ## Why this matters for agents A workflow template turns a multi-step Varosity flow into a one-line call from any agent. Combined with the MCP server, an autonomous ops agent can produce a full marketing reel from a topic spec without ever opening the studio. ## Versioning Templates are immutable once published — edits create a new version (`founder-podcast-promo@2`). Old runs continue to work against the version they were started with. ### Using the CLI > Install @varosity/cli; render videos and music from the terminal. # Using the CLI The CLI is for the cases where the browser is too slow or too closed — agentic flows, headless rendering, automation pipelines. ## Install ```bash npm install -g @varosity/cli varosity --version ``` ## Pair with your account ```bash varosity login # → opens https://varosity.ai/app/keys/api-keys # → paste your vsk_… token when prompted ``` Token is stored at `~/.varosity/config.json` (mode 0600). Never commit this file. ## Common commands ### Render one video ```bash varosity generate video kling-3.0-replicate \ --prompt "wide cinematic establishing shot of a desert at golden hour" \ --duration 6 \ --out shot1.mp4 ``` The CLI submits, polls every few seconds, and downloads when done. ### Generate a voiceover ```bash varosity voices list --json | jq -r '.[].providerVoiceId' | head -3 # → 21m00Tcm4TlvDq8ikWAM # → ErXwobaYiN019PkySvjV # → EXAVITQu4vr4xnSDxMaL varosity generate voice \ --text "Welcome back. Here's what's new this week." \ --voice 21m00Tcm4TlvDq8ikWAM \ --out vo.mp3 ``` ### Background music ```bash varosity generate music elevenlabs-music \ --prompt "cinematic synth, slow build, hopeful" \ --duration 30 \ --out bg.mp3 ``` ### Smart Route ```bash varosity route suggest --shot "talking head close-up with synced lips" --json ``` ### Watch a long render ```bash varosity generate video veo-3.1 --prompt "..." --duration 8 # → returns jobId varosity job wait # blocks until done, prints progress, downloads if --out was set ``` ## Agent mode `--json` on any command makes the output stable and machine-parseable. For Claude Code / Cursor / any MCP host, run: ```bash varosity mcp ``` …which prints the JSON snippet to drop into Claude Desktop's `claude_desktop_config.json`. The MCP endpoint exposes 11 tools (see the agent-mode guide). ## Override the API base ```bash VAROSITY_API_BASE=https://staging.varosity.ai varosity models list ``` Useful for staging environments or self-hosted instances. ### Varosity Agent Video SDK > Production-grade video generation for any agent. One API key, all platforms, no external dependencies. # Varosity Agent Video SDK **Production-grade video generation for any agent.** Design philosophy: **Agent-first, platform-agnostic, dependency-free.** Any agent can integrate this SDK and start generating cinema-quality videos in minutes — no external workflow required. --- ## Core Promise ``` Any Agent + Varosity API Key + This Skill = Production Video ``` Generate professional videos for: - Marketing campaigns - Real estate property tours - E-commerce product showcases - Educational content - Finance/startup pitches - Travel content - Social media (Instagram, TikTok, YouTube) - Custom branding --- ## What This SDK Does ### 1. Unified Platform Access ``` Single Varosity API Key → All Platforms ├── Video: Kling 3.0, Veo 3.1, Runway, Seedance ├── Music: Suno AI ├── Voice: ElevenLabs ├── Rendering: Varosity Composite Engine └── Billing: Single Credits pool ``` ### 2. Splash Frame Consistency ``` Generate reference image (visual bible) ↓ All video shots inherit same aesthetic ↓ Result: Visually cohesive production ``` ### 3. Intelligent Compositing ``` Multiple shots (parallel generation) ↓ Audio sync (music + voice-over) ↓ Auto-transitions & effects ↓ Single production-ready MP4 ``` ### 4. Optional Branding ``` YOUR branding: Logo, colors, tagline, location OR No branding: Plain professional output OR Custom: Inject anything you want ``` --- ## Universal Architecture The SDK is **decomposed into modules** so agents can use what they need: ```python from varosity_sdk import VarosityAgent # Minimal usage agent = VarosityAgent(api_token="vsk_...") video = agent.generate_simple_video( concept="Product showcase", duration_sec=30 ) # OR full orchestration video = agent.orchestrate_production( project_title="Real Estate Tour", shots=[...], audio={...}, branding={...}, end_card={...} ) # OR just parts splash = agent.generate_splash_frame(mood="luxury") shots = agent.generate_video_shots(descriptions=[...]) music = agent.generate_music(style="ambient") # ... compose yourself ``` **Agents choose their level of complexity.** --- ## Use Case Examples ### Marketing Agency Agent ```python for product in products: video = agent.generate_product_video( product_name=product["name"], product_description=product["description"], images=[...], branding={ "logo": "agency_logo.png", "colors": ["#FF6B35", "#004E89"], "tagline": "Crafted for You" } ) # Result: Production-ready MP4s for Instagram/TikTok ``` ### Real Estate Agent ```python video = agent.generate_property_tour( property_address="123 Ocean View Drive", property_images=[exterior, living_room, kitchen, bedrooms], highlights=["Ocean views", "Chef's kitchen", "Smart home", "Pool & spa"], price="$2.5M", branding={ "agent_name": "Sarah Chen", "brokerage": "Luxury Homes Co.", "contact": "sarah@luxuryhomes.com" } ) # Result: Ready for MLS listing, Instagram Stories, YouTube Shorts ``` ### E-Commerce Agent ```python for sku in inventory: video = agent.generate_product_showcase( product=sku, angles=["front", "side", "detail", "in-use"], music_style="upbeat_modern", end_card={"call_to_action": "Shop Now", "link": sku.product_url} ) # Result: Video catalog ready for website, ads, social media ``` ### Finance / Startup Agent ```python video = agent.generate_pitch_video( company_name="TechStartup", problem_statement="Healthcare data fragmentation", solution="AI-powered unification platform", slides=[...], narration="Our platform solves...", branding={"colors": ["#0A1428", "#06A564"]} ) # Result: 2-minute explainer ready for investor meetings ``` ### Educational Content Agent ```python video = agent.generate_educational_video( topic="Python Web Development", sections=[ {"title": "Setup", "duration": 3}, {"title": "Basic Concepts", "duration": 5}, {"title": "Live Demo", "duration": 8}, {"title": "Best Practices", "duration": 4} ], code_snippets=[...], course_branding={"logo": "academy_logo.png", "watermark": "CoursePlatform.com"} ) # Result: Professional course video ready for YouTube/Udemy ``` ### Travel Content Agent ```python video = agent.generate_travel_montage( destination="Iceland", highlights=["Waterfalls", "Northern Lights", "Local Culture"], travel_photos=[...], music={"style": "cinematic_adventure", "mood": "inspiring"}, branding={"travel_blog_name": "World Wanderer", "social_handles": "@wanderer"} ) # Result: TikTok/Instagram Reels/YouTube Shorts ready ``` --- ## API Reference ### Basic Generation ```python class VarosityAgent: def generate_simple_video( self, concept: str, duration_sec: int = 30, style: str = "cinematic" ) -> ProductionResult: """ Simplest interface. AI figures out everything. Args: concept: "Product showcase", "Property tour", etc. duration_sec: Target video length style: "cinematic", "professional", "casual", "energetic" Returns: ProductionResult with video_url, local_path, specs """ def generate_video_shots( self, descriptions: List[str], platform_selection: str = "auto" ) -> List[str]: """Generate individual shots without orchestration.""" def generate_music( self, style: str, duration_sec: int, mood: str = "neutral" ) -> str: """Generate original music via Suno.""" def generate_voice( self, text: str, voice_id: str = "default", pace: str = "natural" ) -> str: """Generate narration via ElevenLabs.""" def orchestrate_production( self, project_title: str, shots: List[Dict], audio: Dict = None, branding: Dict = None, end_card: Dict = None ) -> ProductionResult: """Full 8-stage pipeline.""" ``` ### Branding (Optional) ```python branding_none = None # Plain professional video branding_simple = { "logo": "my_logo.png", "tagline": "Your tagline here" } branding_full = { "logo": "logo.png", "tagline": "Your tagline", "colors": { "primary": "#FF6B35", "secondary": "#004E89", "accent": "#FFFFFF" }, "location_text": "Optional city/website", "style": "modern" # or "classic", "minimal", "premium" } ``` ### No External Dependencies ```python # Just generate and download — no third-party integrations required result = agent.orchestrate_production( project_title="My Video", shots=[...] ) print(f"Video saved to: {result.local_path}") # Agent does whatever it wants with the file ``` --- ## Integration Guide Works with **any agent** — Hermes, OpenClaw, LangChain, CrewAI, AutoGen, n8n, custom Python scripts, or anything else that can run Python and make HTTP requests. ### Step 1: Install **Any agent / generic:** ```bash pip install varosity-sdk # or copy varosity_sdk.py directly into your agent's working directory ``` **Hermes agents:** ```bash hermes skills install varosity-agent-video-sdk # or manually place in ~/.hermes/skills/mlops/ ``` **OpenClaw / other agent frameworks:** drop `varosity_sdk.py` into your skills/tools directory and import it like any other module. ### Step 2: Get Your API Key Sign up at [varosity.ai](https://varosity.ai), generate an API key (`vsk_…`) from `/app/keys`, then: ```bash export VAROSITY_API_KEY="vsk_..." ``` ### Step 3: Use in Your Agent ```python from varosity_sdk import VarosityAgent import os def generate_content(self, request): varosity = VarosityAgent(api_token=os.getenv("VAROSITY_API_KEY")) video = varosity.orchestrate_production( project_title=request["title"], shots=request["shots"], audio=request.get("audio"), branding=request.get("branding") ) return { "status": "success", "video_url": video.final_url, "local_path": video.local_path, "duration": video.duration_sec } ``` ### Step 4: Scale to Any Use Case ```python # Social media agent video = varosity.generate_simple_video(concept="Daily motivation quote", style="inspirational") # Real estate agent video = varosity.generate_property_tour(address="123 Main St", images=[...]) # E-commerce agent video = varosity.generate_product_showcase(product=sku, angles=["front", "side", "in-use"]) # Finance agent video = varosity.generate_pitch_video(company=startup, slides=[...]) ``` --- ## Competitive Positioning | Feature | OpenArt Smart Shot | Varosity Agent SDK | |---------|-------------------|--------------------| | **Setup** | Web UI required | Any agent (Hermes, OpenClaw, LangChain, etc.) | | **Audio** | None | Music + voice built-in | | **Branding** | Not supported | Full customization or none | | **Control** | Low (automated) | High (modular) | | **Platforms** | Single (Seedance) | Kling · Veo · Runway · Seedance | | **Composability** | No (monolithic) | Yes (use any part) | | **Universal** | Web platform only | Any agent can use | | **Use Cases** | Video generation | Any workflow | --- ## Roadmap ``` v2.1 Template library (marketing, real estate, e-commerce, finance) v2.2 Local caching — splash frame cache, credit optimization v2.3 Batch processing — parallel jobs, queue management, cost analytics v2.4 Real-time monitoring — job status API, credit tracking dashboard ``` --- ## Files Structure ``` ~/.hermes/skills/mlops/varosity-agent-video-sdk/ ├── SKILL.md (this file) ├── scripts/ │ ├── varosity_sdk.py (main SDK) │ ├── orchestrator.py (8-stage pipeline) │ ├── platform_selection.py (smart model routing) │ └── branding_engine.py (optional branding) ├── references/ │ ├── api-reference.md │ ├── use-cases.md │ ├── troubleshooting.md │ └── best-practices.md ├── templates/ │ ├── simple-video.json │ ├── product-showcase.json │ ├── property-tour.json │ ├── pitch-video.json │ ├── educational.json │ ├── travel-montage.json │ └── custom-branding.json └── examples/ ├── marketing_agent.py ├── real_estate_agent.py ├── ecommerce_agent.py ├── finance_agent.py ├── education_agent.py └── social_media_agent.py ``` --- **One API key. All platforms. Any agent. Pure video production.** ✅ Marketing agents — campaign videos ✅ Real estate agents — property tours ✅ E-commerce agents — product showcases ✅ Finance agents — pitch animations ✅ Social media agents — TikTok/Reels ✅ Education agents — course tutorials ✅ Travel agents — destination montages ✅ Custom agents — whatever they need ### Voice over your video > Add narration with ElevenLabs, OpenAI TTS, or Deepgram. # Voice over your video Three providers, one workflow. Pick the right one: | Provider | Strength | Pricing tier | |--------------------|---------------------------------------|--------------| | ElevenLabs | Highest quality + voice cloning | $$ | | OpenAI TTS-1 | Fast, reliable, six built-in voices | $ | | Deepgram Aura 2 | Lowest latency, cheapest | $ | | Cartesia Sonic 2 | Ultra-low latency, large voice library| $ | ## Agents: discover valid voice ids first Call **`list_voices`** before `generate_voice`. It returns every built-in voice with its `voiceModel` + exact `voiceId` (e.g. `openai-tts-1` → `nova`, `deepgram-aura-2` → `aura-2-thalia-en`). Voice ids are **provider-specific** — passing the wrong scheme (e.g. `aura` to Deepgram) safely falls back to that provider's default voice, but using the exact id gives you the voice you want. ## Pull your voice library 1. Add the provider's key in **Settings → Keys**. 2. Visit `/voices`. Hit **Sync** on the provider's row. Your library lands in the grid. 3. Click ▶ on any voice to preview. ## Generate audio for a shot In the shot inspector, expand **Avatar**. The audio row has two tabs: - **Upload** — drop an mp3/wav you already have. - **From script** — paste text, pick a voice, hit Generate. The selected provider synthesizes (via your key on BYOK, or Varosity Credits), audio uploads to Storage, attached to the shot in one click. ## Voice cloning (ElevenLabs only) `/voices` → Coming-soon UI for v2. Until then, clone via ElevenLabs's UI and `/voices` will sync the cloned voice on next refresh. ## Tips - Keep narration scripts under 30 seconds per shot. Longer reads stretch past video durations and clip awkwardly. - For dialogue (lip-sync to the on-screen face), use **Veo 3.1 + Full avatar mode** instead of TTS over a separate clip. Veo synthesizes the lip motion from the audio. - TTS audio attached to a shot becomes its primary audio track in the stitch. Background music ducks under it automatically — see the music guide.