Consistency · character consistency AI video

The Identity Drift Problem: 7 Architectures for Keeping Characters Consistent

Ask 100 AI filmmakers what's broken and they'll tell you the same thing. A CVPR 2025 workshop survey (arXiv 2504.08296, Zhang et al.) did exactly that. Character movement consistency ranked first. Camera control second. Overall character consistency third. Not generation quality. Not resolution. Not speed. Consistency.

The character in shot 1 has brown hair. In shot 3 it's auburn. By shot 7 she's wearing a different jacket. The background shifts. Props appear and vanish. You know the failure mode — it's why most AI films feel like fever dreams rather than stories.

Seven architectures from recent papers attack this problem at different layers of the stack. They range from "change nothing about your setup, just be more careful" to "retrain the model with a custom loss function." Here's what each does, what it costs, and when it works.

1. Same prompt, same LoRA — the baseline everyone starts with

You already do this. Use the same character description in every prompt. Maybe load the same LoRA for the character's face. Copy-paste the scene description.

It's better than nothing and worse than everything else on this list. The failure mode is cumulative drift: each generation independently samples from the distribution, and small deviations compound. By shot 10 you've wandered far from shot 1 even though every prompt is identical. The model doesn't remember what it generated before. It just re-rolls.

InfinityStory (arXiv 2603.03646) documented a specific failure here: multi-character injection causes identity changes. When they injected two or more reference images simultaneously, characters started blending. Switching from Qwen Image Edit to OmniGen2 helped — the model architecture matters even at the reference conditioning level.

2. Reference image conditioning — the standard approach

Most commercial tools support this now. Upload a reference image, the model conditions on it. Kling's Elements feature, Runway's Director Mode, Seedance 2.0's character reference — they all do some version of this.

It works for the conditioned attributes (face, broad body shape) and fails for everything else (clothing details, posture, spatial relationship to environment, props in hands). The reference image anchors identity but doesn't anchor context. Your character looks like herself but she's in a different outfit standing in a different place holding a different object.

Still, reference conditioning is the minimum viable consistency mechanism. If you're not doing at least this, nothing else on the list will help.

3. Recursive shot conditioning — cheapest agent-free option

Camera Artist's Recursive Shot Generation (arXiv 2604.09195) conditions each shot's planning on the full context of the preceding shot. Not just the character reference — the entire previous shot's description, output, and camera setup become context for shot N+1.

This is the cheapest continuity mechanism because it adds zero extra agents or API calls. You're just extending the context window. Shot 5 knows what shot 4 looked like, which knows what shot 3 looked like, and so on. Drift is bounded because each step is anchored to its predecessor rather than independently sampling from the base distribution.

The limitation is obvious: context windows have finite length, and the conditioning becomes diluted as the chain grows. Shot 20's connection to shot 1 passes through 19 intermediaries. Long sequences still drift. But for 5-10 shot scenes — which covers most practical use cases — recursive conditioning buys you meaningful consistency at near-zero cost.

4. Dynamic Memory Bank — the entity registry

VideoMemory (arXiv 2601.03655, Du et al., January 2026) built what is essentially a database for your story's visual entities. The Dynamic Memory Bank stores explicit visual and semantic descriptors for every character, prop, and background in the narrative. Before generating each shot, the system retrieves the relevant entity descriptors and conditions the generation on them. After generating, it updates the memory to reflect any story-driven changes.

That update step is what separates this from a static reference bank. If a character puts on a hat in shot 5, the memory bank records "character A now wearing hat" and all subsequent shots get that updated descriptor. The memory changes when the story changes, but doesn't drift randomly.

VideoMemory's benchmark — 54 cases covering character-persistent, prop-persistent, and background-persistent scenarios — shows the approach works. The retrieval-update mechanism enables consistent portrayal across distant shots, which is exactly where simpler approaches fail.

Implementation-wise, you're maintaining a structured data store (character ID → visual descriptor, semantic descriptor, last-seen frame) that gets queried and updated per shot. The cost is one retrieval + one update per shot, plus whatever overhead the structured storage adds. Not free, but far cheaper than regenerating failed shots.

5. Continuity memory graph — global vs local separation

StoryBlender (arXiv 2604.03315, April 2026) takes the memory concept further with a formal graph structure. The continuity memory graph separates global assets (things that persist across the entire story — character identity, recurring locations, permanent props) from shot-specific variables (pose, expression, lighting conditions in this particular moment).

This separation matters because it prevents a subtle failure mode: the system treating a character's transient expression as part of their permanent identity. If the character frowns in shot 3, a flat memory bank might encode "frowning" as part of the character's identity and propagate it to shot 4 where she should be smiling. The graph structure explicitly tags what's permanent and what's momentary.

StoryBlender also instantiates entities in a unified coordinate space — every character and prop exists in one consistent spatial world, even though they're rendered separately per shot. This is the 3D equivalent of ShotVerse's camera coordinate alignment: spatial consistency requires a shared reference frame.

The system includes an engine-verified feedback loop that catches spatial hallucinations — like a character standing inside a table — and iteratively self-corrects. It's heavier than VideoMemory's flat registry, but it handles spatially complex scenes where relative positions matter.

6. Role-Attention Consistency Loss — training-time constraint

DreamShot (arXiv 2604.17195, April 2026) operates at a different layer entirely. Instead of adding memory or conditioning at inference time, they modify the training objective. Their Role-Attention Consistency Loss explicitly constrains the attention mechanism to align reference character images with their appearances in generated frames.

A multi-reference role conditioning module accepts multiple character reference images and enforces identity alignment via the loss function during fine-tuning. The model learns to produce consistent characters structurally, not just through prompting.

The tradeoff is clear: this requires LoRA fine-tuning (or full fine-tuning if you're feeling wealthy), which means compute cost upfront and a model that's tuned for your specific characters or style. You can't swap characters at inference time the way you can with VideoMemory's descriptor lookup. But the consistency you get is baked deeper — it's in the weights, not the prompt.

StoryMem (arXiv 2512.19539, December 2025) offers a lighter version of this: Memory-to-Video design where keyframe memory is injected via latent concatenation with negative RoPE shifts, requiring only LoRA fine-tuning. Their semantic keyframe selection strategy decides which frames are worth storing — not every frame contributes equally to identity anchoring. Good keyframe selection means smaller memory with higher signal.

7. Instance tracking + face swap pipeline — the brute force option

CineAGI (arXiv 2604.23579, April 2026) abandons the idea of getting the generation model to maintain consistency on its own. Instead, it decouples the problem: generate the shots, then fix the faces.

The pipeline: Grounded-SAM2 detects and segments character instances in each generated frame. SimSwap performs identity-preserving face integration — swapping in the canonical face for each character. The system tracks instances across shots to know which face goes where.

This is brute force and it works. The generation model can drift as much as it wants because the face swap corrects identity in post. The downsides: it only fixes faces (not clothing, not body shape, not props), it can produce uncanny-valley artifacts at the face boundaries, and it adds a full post-processing pass per frame.

But if your specific problem is "same character, different face across shots" — which is the most visible consistency failure — CineAGI's detect-track-swap pipeline is the most reliable fix because it doesn't depend on the generation model cooperating.

The numbers that matter

CANVAS (arXiv 2604.13452, Mondal et al., April 2026) provides the cleanest comparison against baselines that don't use explicit continuity planning. Their multi-agent framework improves background continuity by 21.6%, character consistency by 9.6%, and props consistency by 7.6% over the same generation models without continuity mechanisms.

The 21.6% background number is surprising. Most people fixate on character consistency, but background drift is actually worse in absolute terms — and easier to fix because backgrounds change less within a scene. CANVAS's approach of explicitly planning visual continuity as a distinct pipeline stage (not implicit in generation) captures this.

STAGE (arXiv 2512.12372, December 2025) contributes the ConStoryBoard dataset — large-scale movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. If you're building or evaluating any of these consistency mechanisms, ConStoryBoard is the benchmark.

Choosing your architecture

If you're generating 3-5 shots and want to change nothing about your setup: recursive conditioning (approach 3). Just extend the context.

If you're generating 5-15 shots and can add a data layer: VideoMemory-style entity registry (approach 4). The retrieval-update loop is the sweet spot of implementation cost vs consistency gain.

If you're building spatially complex scenes with characters moving through environments: StoryBlender's continuity graph (approach 5). The global/local separation prevents spatial hallucinations.

If you have the compute for fine-tuning and want the deepest consistency: Role-Attention Consistency Loss (approach 6) or StoryMem's latent injection. Consistency in the weights beats consistency in the prompt.

If your specific failure mode is face inconsistency and you need it fixed now: CineAGI's face swap pipeline (approach 7). Ugly but effective.

And if you're not doing reference image conditioning (approach 2) at minimum, start there before trying anything else. The fanciest memory bank won't help if the model doesn't know what your character looks like.

Topics covered

character consistency AI videocross-shot consistency AIentity memory bank videoidentity preservation multi-shotcharacter drift AI video