Sequences · multi-shot video generation
From Single Clips to Full Sequences: The 5 Paradigms of Multi-Shot Generation
Every AI video tool generates great 10-second clips. String twelve of them together and you get a slideshow. The gap between "one good shot" and "twelve shots that feel like a film" is where the entire multi-shot generation field lives, and it's split into five fundamentally different approaches. Each makes different tradeoffs on consistency, flexibility, compute cost, and output quality.
Every AI video tool generates great 10-second clips. String twelve of them together and you get a slideshow. The gap between "one good shot" and "twelve shots that feel like a film" is where the entire multi-shot generation field lives, and it's split into five fundamentally different approaches. Each makes different tradeoffs on consistency, flexibility, compute cost, and output quality.
If you're building a multi-shot pipeline, you need to pick a paradigm. Here's how to choose.
1. Stitching: generate separately, smooth the seams
Generate each shot independently from its own prompt, then apply cross-shot smoothing at the boundaries to reduce visual discontinuity.
VideoGen-of-Thought (arXiv 2412.02259, NeurIPS 2025 Workshop Oral) is the clearest implementation. Script → keyframes → per-shot video → smoothing. Their v2 adds five-domain shot specification (character dynamics, background continuity, relationship evolution, camera, lighting) and reports 20.4% better within-shot face consistency, 17.4% style consistency, 100% better cross-shot consistency, and 10x fewer manual adjustments than MovieDreamer and DreamFactory baselines.
FilmWeaver (arXiv 2512.11274) extends stitching with cache-guided autoregressive diffusion — each shot conditions on a compressed cache of prior shots rather than generating blind.
Pick stitching when: you need to use different models for different shots (Runway for landscapes, Kling for characters, Pika for action), or you're working with existing single-shot models and can't retrain. The smoothing quality depends on your boundary mechanism but the fundamental isolation of per-shot generation limits how consistent the result can be.
Compute cost: Low. One generation per shot plus one smoothing pass.
2. Keyframe-based: anchor the endpoints, infill the motion
Generate a set of consistent keyframes — still images that define each shot's composition and character appearances — then generate video segments between them.
STAGE (arXiv 2512.12372, December 2025) predicts start-end frame pairs for each shot (their STEP2 approach), uses a multi-shot memory pack for long-range entity consistency, and applies dual-encoding for intra-shot coherence. They trained on ConStoryBoard, a large-scale dataset with fine-grained cinematic annotations.
DreamShot (arXiv 2604.17195, April 2026) uses video diffusion priors for the storyboard itself — not just images but video-aware keyframes. Their Role-Attention Consistency Loss constrains attention alignment during fine-tuning, so character identity is enforced at the architecture level.
IC-LoRA in Consistent Keyframe Synthesis (arXiv 2504.19894) fine-tunes FLUX to produce multiple images with consistent content from structured LLM-planned shot descriptions.
Pick keyframe-based when: character identity is your primary consistency concern and you can afford a fine-tuning step. Keyframes anchor identity at the endpoints; the infill model handles motion. The weakness: consistency violations can still creep in during infill because the video generation between keyframes operates in isolation.
Compute cost: Medium. Keyframe generation + per-shot video infill + optional fine-tuning upfront.
3. Holistic: one model, multiple shots simultaneously
Train a single model to generate the entire multi-shot sequence at once, learning cross-shot consistency from film data.
HoloCine (arXiv 2510.20822, October 2025) is the most ambitious. Built on Wan 2.2's 14B DiT backbone, trained on 400,000 multi-shot samples curated from films and TV. Up to 13 shots, 60 seconds, at cinematic quality. Their Sparse Inter-Shot Self-Attention maintains dense attention within shots (for motion coherence) and sparse attention across shots (for consistency without quadratic compute).
Training required 128 NVIDIA H800 GPUs. The output is the best-looking multi-shot generation I've seen in the papers — because consistency isn't bolted on, it's learned from how real films are actually cut.
ShotStream (arXiv 2603.25746) makes holistic generation streaming — you can add shots incrementally rather than specifying all of them upfront. This enables interactive creative workflows where you generate shot 1, evaluate it, then generate shot 2 conditioned on the actual shot 1.
Pick holistic when: you have the resources to train or fine-tune a large model and you want the most naturally cinematic output. The consistency is the best available because it's learned from real film data, not imposed through external constraints.
Don't pick it when: you need more than 13 shots, longer than 60 seconds, or the ability to swap generation models per shot. You're locked to one model's capabilities and aesthetic.
Compute cost: Very high for training (128 H800s). Low for inference (one forward pass for the whole sequence).
4. Agent-orchestrated: LLM plans, models execute
LLM agents wearing role-specific hats (director, cinematographer, screenwriter) plan the film's structure, shot list, and camera language. Specialized generation models execute each shot. Multi-agent collaboration patterns ensure quality through critique, debate, and validation loops.
This is the paradigm with the most papers: FilmAgent, Mind-of-Director, MovieAgent, Camera Artist, Co-Director, GenMAC, CineAGI, FilMaster. Article 1 in this series covers the collaboration patterns in detail.
Pick agent-orchestrated when: you need maximum flexibility — swap generation models, adjust the planning logic, add validation gates, customize the critic roles. This paradigm treats the planning and execution as independent layers, so you can upgrade either without rebuilding the other.
Don't pick it when: your token budget is tight. Multi-agent collaboration costs 2-4x baseline tokens per stage, and a full pipeline with script development + blocking + camera planning + post-production audit can run to dozens of API calls per shot.
Compute cost: Medium-high. Dominated by LLM API costs (many calls) rather than GPU compute (one generation per shot).
5. Transition-learned: teach the model how films cut
Train models specifically on cinematic transitions — how films cut between shots, how framing changes across edits, how shot-reverse-shot structures build dialogue rhythm.
CineTrans (arXiv 2508.11484) built Cine250K, 250,000 shot transition annotations, and trained a masked diffusion model on film-style cuts. The model learns that a conversation uses shot-reverse-shot. An action sequence uses rapid cuts with increasing pace. An establishing sequence uses wide-to-medium-to-close progression.
ShotDirector (arXiv 2512.10286) built ShotWeaver40K with even richer editing pattern annotations. Their model learns controllable transitions — you can specify the transition type and the model generates accordingly.
Pick transition-learned when: your specific problem is ugly cuts between otherwise good shots. The "slideshow" problem — twelve good clips that feel like a Powerpoint when assembled — is what these approaches solve. They don't help with character consistency or camera language; they help with how shots connect.
Compute cost: Medium. Training on film transition datasets + inference.
The convergence
These paradigms aren't staying separate. VGoT (stitching) added five-domain conditioning that looks like agent-orchestrated planning. DreamShot (keyframe) adopted video priors from holistic approaches. ShotStream (holistic) became interactive like agent-orchestrated systems. HoloCine (holistic) learns transitions natively from its film corpus.
The end state is probably: agent-orchestrated planning (from paradigm 4) dispatching to holistic generation models (paradigm 3) that have learned transitions from film data (paradigm 5), with memory banks maintaining entity consistency (from the consistency literature) and keyframe anchoring for long sequences (paradigm 2).
We're not there yet. But the pieces exist in different papers, waiting to be assembled. Whoever puts them together builds the tool that replaces the slideshow with cinema.
Topics covered