← All articles16 · 5 min

Sequences · multi-shot video generation

MCTS for Shot Selection: Why Monte Carlo Tree Search Beats Single-Pass

Generate shot 1. Looks good. Generate shot 2. Looks good. Generate shot 3. Looks good. Assemble them. Shot 3's color grade clashes with shot 1. Shot 2's character is facing the wrong direction for the cut from shot 1 to work. The sequence fails even though every individual shot passes quality inspection.

Generate shot 1. Looks good. Generate shot 2. Looks good. Generate shot 3. Looks good. Assemble them. Shot 3's color grade clashes with shot 1. Shot 2's character is facing the wrong direction for the cut from shot 1 to work. The sequence fails even though every individual shot passes quality inspection.

This is the local vs global quality problem. Each shot is locally optimal — good in isolation — but the combination isn't globally optimal. The sequence quality depends on how shots interact, not just how each shot looks alone.

AniMaker (arXiv 2506.10540, June 2025) applies Monte Carlo Tree Search to this problem. Instead of accepting the first generation of each shot, generate multiple candidates per shot position and use MCTS to find the sequence that works best as a whole.

How MCTS applies to video

In game AI, MCTS explores a tree of possible moves, simulating many random playouts from each position to estimate which move leads to the best outcome several moves ahead. The tree branching factor (how many options per move) times the depth (how many moves ahead) determines the search space.

For multi-shot video, the analogy maps cleanly. Each "move" is a shot. Each "option" is a generation candidate for that shot position. The "depth" is the number of shots in the sequence. The "playout" evaluates the entire sequence from that point forward.

Say you're generating a 6-shot scene and you generate 3 candidates per shot. Naive approach: pick the best candidate for each position independently (6 evaluations, 18 generations). MCTS approach: explore the tree of possible sequences (6 × 3 = 18 options at each level, 3^6 = 729 possible sequences) and find the globally optimal path. You don't need to evaluate all 729 — that's the beauty of MCTS. The algorithm focuses search on promising branches and prunes unpromising ones.

AniMaker's implementation uses their AniEval evaluation framework to score sequences. The scoring considers visual consistency across shots, narrative coherence, motion quality, and aesthetic appeal. The MCTS explores candidate sequences, evaluates them via AniEval, and converges on the sequence that maximizes the global score.

The cascade problem

This matters because of cascading dependencies. Shot 3's framing constrains what shot 4 can do. If shot 3 ends with the character on the left side of the frame, shot 4 needs to account for that — either continuing the left-side framing or crossing the axis deliberately. If you selected shot 3 based only on its individual quality, you might have picked a version whose ending position makes shot 4 much harder.

In a single-pass pipeline, this cascade is invisible until assembly. You generate each shot, each looks fine, then you see the sequence and realize shot 3 painted you into a corner. Regenerating shot 3 means re-evaluating shots 4, 5, and 6 because the cascade propagates forward.

MCTS handles this by evaluating sequences, not shots. The algorithm "sees" that a slightly worse shot 3 enables a much better shot 4-5-6 trajectory, and chooses accordingly. It sacrifices local quality for global quality — the right tradeoff for sequences.

Co-Director's parallel approach

Co-Director (arXiv 2604.24842, Google, April 2026) applies Multi-Armed Bandit at a different level — creative direction rather than clip selection. MAB explores whether the sequence should be dramatic or comedic, fast-paced or contemplative, character-focused or environment-focused. It's explore/exploit at the narrative level.

The relationship between MCTS and MAB mirrors two different uncertainty types. MAB handles "I don't know what kind of film to make" (creative direction uncertainty). MCTS handles "I know what kind of film to make but don't know which specific clips work best together" (execution uncertainty). You might use MAB to choose your narrative strategy, then MCTS within that strategy to select optimal clips.

The cost math

MCTS isn't cheap. Generating 3 candidates per shot for a 6-shot sequence means 18 video generations instead of 6 — 3x the compute budget. The tree search and evaluation add LLM API costs on top. For a 12-shot sequence with 3 candidates each, you're at 36 generations minimum.

Is 3x worth it? The answer depends on your regeneration rate. If you currently assemble sequences, find 40% have at least one shot that doesn't work in context, and regenerate those shots (which cascades into regenerating subsequent shots), your effective cost is already higher than 1x per shot. MCTS pays upfront to avoid the regeneration tax.

AniMaker doesn't publish exact ROI numbers, which frustrates me. The paper demonstrates quality improvements but doesn't quantify the compute-cost-to-quality tradeoff explicitly. My estimate based on the cascade math: if your assembly failure rate exceeds ~25%, MCTS with 3 candidates per shot saves compute overall because it avoids the regeneration cascade. Below 25%, single-pass with selective regeneration is cheaper.

When to use it

MCTS makes sense for long sequences where shot interactions matter more than individual shot quality. A 12-shot narrative where character movement, framing, and color grade need to flow across cuts. A product video where the shot-to-shot rhythm determines engagement.

MCTS doesn't make sense for independent clips (social media posts, standalone shots), short sequences (2-3 shots where you can just eyeball the assembly), or real-time generation (MCTS requires evaluating multiple candidates, which takes time).

For most practitioners, the practical entry point isn't full MCTS but a lighter version: generate 2-3 candidates per shot at the positions where context dependency is highest (the first shot after a location change, the first shot of a new scene, reaction shots that must match action shots). Use a VLM to evaluate which candidate creates the best sequence, not just the best individual shot. This captures most of MCTS's benefit at a fraction of the cost.

The deeper principle: sequence quality is not the sum of shot qualities. Evaluating shots in context — how they connect, how they flow, how they cascade — catches failures that per-shot evaluation misses. Whether you use full MCTS or a lightweight version, the shift from "is this shot good?" to "does this shot make the sequence better?" changes what you ship.

Topics covered

multi-shot video generationAI video pipeline automationMCTS video clip selectionAniMaker tree searchlocal vs global shot quality