Camera & previz · camera template library AI

Camera Templates Beat Prose: The Shot Vocabulary That Stops Framing Drift

Write "medium shot, low angle, warm key light" in ten consecutive prompts and you'll get ten different framings. The model interprets "medium" differently each time. "Low angle" might mean 15 degrees or 45. "Warm" could be golden hour or tungsten. You know this already if you've tried to maintain visual consistency across a multi-shot AI video. Every shot re-rolls the dice on what your camera language means.

This isn't a model quality problem. It's a specification problem. And a cluster of recent papers converge on the same fix: stop describing shots in prose and start specifying them in parameterized templates.

The 21-template library

Mind-of-Director (arXiv 2603.14790, Nan et al., March 2026) built a camera template library with 21 categories — 12 for single-person shots, 8 for two-person, 1 for group. But calling them "templates" undersells what they actually are.

Each template is a parameterized function T_k(Θ_k) with typed slots. A two-person over-the-shoulder template doesn't just say "over the shoulder." It specifies: which character is foreground, which is background, camera distance in scene units, vertical angle, horizontal offset, lens focal length equivalent, and whether the camera is static or tracking. The LLM's job isn't to describe the shot — it's to pick a template name and fill parameter values.

This distinction matters enormously. When the model writes "over-the-shoulder medium shot," every generation re-interprets that prose into different concrete values. When the model outputs TWOSTATIC_OTS(foreground=CharA, background=CharB, distance=2.4, angle=eye_level, offset=0.3), there's exactly one interpretation. The parameter slots constrain the space of possible shots without removing creative choice — you still pick which template and which parameters.

The ablation supports this. Camera shot accuracy went from 64.4% to 79.2% in the multi-agent system, and the template library is a major contributor. The paper doesn't isolate the template effect from the multi-agent effect (a weakness I wish they'd addressed), but the architecture makes the mechanism clear: templates eliminate the ambiguity that caused the single-agent system to drift.

The learned alternative: Cinematic Language Injection

Camera Artist (arXiv 2604.09195, April 2026) took a different approach to the same problem. Instead of hand-building a template library, they fine-tuned a small LLM on professional cinematography vocabulary — what they call Cinematic Language Injection.

The idea: take a generic shot description like "two people arguing in a kitchen" and transform it into film-specific language like "medium two-shot, slight Dutch angle, 35mm wide lens for spatial tension, practicals motivating key light, handheld micro-movements for instability." The fine-tuned model acts as a translator between what a non-cinematographer would write and what a DP would specify.

CLI sits between hand-built templates and free-form prose on a spectrum of control vs flexibility. Templates give you maximum constraint — the shot is specified by discrete parameters. CLI gives you directorial vocabulary without discrete parameters — it's still prose, but prose that's been trained to be specific and consistent. Free-form prompting gives you neither.

I find CLI more interesting than the paper's own evaluation suggests. The authors frame it as one component of their pipeline, but if you squint, it's solving the fundamental problem: AI video models respond to specific cinematographic vocabulary better than generic descriptions, and most users don't speak that vocabulary. A specialist translator bridges the gap.

The practical question is whether a fine-tuned CLI model generalizes across scene types. The paper shows it works for their test set, but doesn't probe edge cases — does it handle underwater shots? Zero-gravity? Non-narrative contexts? My guess is it's brittle at the boundaries, which is where templates have the advantage: if you've defined a template for the shot type, it works by construction.

RAG over film corpora: the data-driven path

FilMaster (arXiv 2506.18899, Huang et al., KwaiVGI/Kuaishou, June 2025) went further than either approach. Instead of hand-building templates or training a translation model, they built a RAG system over 440,000 real film clips. When the pipeline needs to decide how to shoot a scene, it retrieves clips from actual films that handled similar scenes — and uses those as camera language guidance.

This is a fundamentally different philosophy. Templates encode a cinematographer's knowledge as rules. CLI encodes it as a learned transformation. RAG encodes it as examples from the entire history of cinema. "How did films shoot an argument in a kitchen? Here are 47 examples from the corpus." The LLM then synthesizes camera language from those examples.

The coverage advantage is obvious — 440K clips contain more cinematic knowledge than any hand-built template library or fine-tuning dataset. The control disadvantage is also obvious — retrieval doesn't guarantee consistency. Two adjacent shots might retrieve examples from films with incompatible visual styles. FilMaster mitigates this with what they call "multi-shot synergized RAG" — retrieving examples that are consistent with previous shots in the sequence, not just the current shot in isolation — but it's still less deterministic than template parameters.

ShotVerse: fixing the coordinate problem

Even with good shot specifications, there's a subtler consistency problem. Each shot in a multi-shot sequence exists in its own camera space. "Camera 2 meters from the subject" means something different in shot 1 vs shot 5 if the subjects have moved or the environment has shifted.

ShotVerse (arXiv 2603.11421, Yang et al., March 2026) attacks this directly. They built an automated calibration pipeline that aligns disjoint single-shot camera trajectories into a unified global coordinate system. Their "Plan-then-Control" framework separates what the camera should do (VLM Planner) from how to execute it (Controller with camera adapter).

The insight: camera consistency isn't just about specifying similar parameters — it's about those parameters meaning the same thing across shots. A "medium shot at 2.4m" should produce the same framing regardless of which shot in the sequence it appears in. Without coordinate alignment, it won't.

What this looks like in practice

Say you're generating a five-shot dialogue scene. Here's the difference between the three approaches:

With prose: you write "medium shot of Alice talking" → "close-up of Bob reacting" → "wide shot of both" → "over-the-shoulder from Alice's POV" → "medium shot of Bob responding." Each prompt is re-interpreted independently. The medium shots probably don't match. The wide shot probably doesn't preserve the spatial relationship established in the first frame. The OTS might flip the axis.

With templates: you write SINGLE_STATIC_MED(subject=Alice, distance=2.0, angle=eye, lens=50mm) → SINGLE_STATIC_CU(subject=Bob, distance=1.2, angle=slight_low, lens=85mm) → TWOSTATIC_WIDE(left=Alice, right=Bob, distance=4.0, angle=eye, lens=24mm) → TWOSTATIC_OTS(foreground=Bob, background=Alice, distance=2.4, angle=eye, offset=0.3, lens=50mm) → SINGLE_STATIC_MED(subject=Bob, distance=2.0, angle=eye, lens=50mm). The parameters lock each shot. The last shot matches the first shot's parameters because they're explicitly the same.

With CLI: the fine-tuned model transforms your prose into specific cinematographic language. You still write natural descriptions, but the output is "50mm at eye level, character fills the middle third, shallow focus isolating subject from practicals in background." More specific than raw prose, less constrained than templates.

Which to use

If you're building a pipeline that runs autonomously — no human checking each frame — templates give you the most deterministic output. The upfront cost is defining your template vocabulary (Mind-of-Director's 21 categories are a good starting set), but once defined, the system can't drift.

If you're a filmmaker using AI tools interactively — generating shots and selecting the good ones — CLI gives you the vocabulary upgrade without the rigidity. You write what you mean, the translator makes it specific.

If you have access to a large film corpus and the engineering capacity to build a retrieval system — RAG gives you the widest range of cinematic references. But it's the most expensive to build and the hardest to make consistent.

The research also points to a spectrum paper that's easy to miss: "Can video generation replace cinematographers?" (arXiv 2412.12223) built a full cinematic language dataset of shot framing, angles, and camera movements. They use it to fine-tune T2V models directly on cinematic patterns. If you're training or fine-tuning your own model, this dataset is the starting point for baking cinematic vocabulary into the model itself, rather than specifying it at prompt time.

The through-line across all of these: the problem isn't that AI video models can't produce good shots. It's that natural language is an ambiguous interface for specifying what you want. Every approach that reduces ambiguity — templates, learned translators, retrieval, fine-tuning — improves consistency. The cheapest entry point is a small template vocabulary in your YAML configs. Start there.

Topics covered

camera template library AIcinematic camera control AIAI shot planningcinematic language AIparameterized shot templates AI