← All articles09 · 5 min

Consistency · cross-shot consistency AI

Scene-Level Style Lock: Why Your Establishing Shot Should Anchor Every Frame

Background consistency improves by 21.6% when you explicitly plan for it. Character consistency improves by 9.6%. Props by 7.6%. Those numbers are from CANVAS (arXiv 2604.13452, Mondal et al., April 2026), comparing the same generation models with and without explicit continuity planning.

Background consistency improves by 21.6% when you explicitly plan for it. Character consistency improves by 9.6%. Props by 7.6%. Those numbers are from CANVAS (arXiv 2604.13452, Mondal et al., April 2026), comparing the same generation models with and without explicit continuity planning.

The background number is the surprise. Most AI filmmakers fixate on character consistency — understandably, since a character changing face between shots is the most visible failure. But backgrounds drift worse in absolute terms, and the fix is simpler. Your character's appearance is a complex, high-dimensional signal. Your background is largely static within a scene. Lock it once, reuse it everywhere.

The establishing shot as spatial anchor

Mind-of-Director (arXiv 2603.14790) generates what they call a 2D guidance image from each scene's description. This image encodes the spatial layout — where the furniture is, where the windows are, what the walls look like, how the objects are arranged. Every subsequent generation in that scene uses this guidance image as a spatial prior.

In their pipeline, this is fed into a Unity scene builder. In yours, it's simpler: your first wide shot — the establishing shot — becomes the style reference for every subsequent shot in that location. You already generate it. You already look at it. The missing step is feeding it back as a conditioning signal for the remaining shots.

When you generate a close-up of a character at a bar, the model reinvents the bar each time. The bottles are different. The lighting shifts. The wood grain changes. But if you pass the establishing wide shot of the bar as a style reference alongside the close-up prompt, the model anchors its interpretation of "bar" to the specific bar you already established. Same bottles. Same lighting. Same grain.

This is the scene-level equivalent of character reference conditioning. You already use reference images for faces. Use reference images for places.

InfinityStory's background pool

InfinityStory (arXiv 2603.03646) formalizes this with a reusable background asset pool. Their agentic story planning creates location-based background prompts — each location gets its own persistent background specification. When the story returns to a location (the apartment, the office, the street corner), the system retrieves the background from the pool rather than regenerating it.

The pipeline separates background generation from character generation. Backgrounds are generated once per location and cached. Characters are composited into the cached backgrounds. This separation prevents the generation model from re-interpreting the background every time it draws a character in it.

Their finding about multi-character injection is relevant here too. When injecting multiple reference images simultaneously, character identities blend — but backgrounds are more stable because they're lower-complexity signals. Separating background from character lets you stabilize the easy thing (background) independently from the hard thing (character identity with multiple subjects).

StoryBlender's global vs local separation

StoryBlender (arXiv 2604.03315) has the cleanest formal model for this: a continuity memory graph that explicitly separates global assets from shot-specific variables.

Global assets are things that persist across the entire story or scene: the bar's physical layout, the character's face, the color of the walls, the position of the window. Shot-specific variables are things that change per shot: the character's expression, the camera angle, the time of day (if it's shifting), which characters are present.

The graph structure prevents a subtle but common failure: the system treating a shot-specific variable as a global asset. If shot 3 has warm evening light and the system encodes "warm evening light" as a global property of the bar, then shot 4 (set the next morning) inherits the wrong lighting. The graph explicitly tags what's permanent and what's momentary.

For practical implementation, this means maintaining two data structures. A scene registry holds the permanent properties of each location (layout, decor, color palette, fixed props). A shot context holds the transient properties of the current shot (lighting, character expressions, camera position, temporary props). The generation prompt pulls from both, but only the scene registry persists across shots.

How to implement this today

You don't need a graph database or a formal memory system. The minimum viable version:

For each scene in your story, generate the establishing wide shot first. This becomes your scene anchor. Save it.

For every subsequent shot in that scene, include the scene anchor as a style reference or image prompt alongside your character references and shot-specific prompt. The specifics depend on your generation tool — most support some form of image-to-image conditioning, style reference, or ControlNet-style guidance.

Keep a simple lookup: scene ID → anchor image path. When your pipeline generates shot 7 and shot 7 is set in the bar, it looks up the bar's anchor image and includes it in the generation call.

If the scene changes locations (the characters walk from the bar to the street), generate a new establishing shot for the street and add it to the registry. The bar's anchor persists in case the story returns there later.

That's it. One extra image reference per generation call. The cost is near-zero — you're already generating and conditioning on character references. Adding a scene reference is the same mechanism applied to place instead of person.

Why this works better than re-describing

The alternative — writing detailed scene descriptions in every prompt — fails for the same reason prose camera language fails (see article 2). "A dimly lit bar with wooden countertop, bottles behind the bartender, warm ambient lighting from overhead pendants" gets reinterpreted every generation. Each run samples a slightly different bar. The bottles move. The wood tone shifts. The pendants change shape.

An image reference removes the ambiguity. The model doesn't interpret "wooden countertop" — it matches the specific wooden countertop in the reference image. The anchoring is visual, not semantic, which is why it's more reliable.

CANVAS's 21.6% background improvement comes precisely from this shift: from semantic specification (describe the scene in words each time) to visual anchoring (show the scene once, reference it always). The technique is obvious once you see it. The fact that it produces a larger improvement than character consistency techniques (21.6% vs 9.6%) should change how you prioritize your consistency efforts.

Fix the background first. It's easier, cheaper, and higher-impact. Then tackle the harder problem of character identity with the more expensive techniques from article 3.

Topics covered

cross-shot consistency AIAI scene assemblyscene anchor imagebackground consistency AI videoestablishing shot style reference