Consistency · entity memory bank video
Memory Banks for Video: How AI Remembers Characters Across Scenes
AI video models have no memory. Each shot starts fresh. The model doesn't know what your character looked like in shot 1 when it generates shot 5. Every consistency mechanism is a hack to inject memory into a memoryless system.
AI video models have no memory. Each shot starts fresh. The model doesn't know what your character looked like in shot 1 when it generates shot 5. Every consistency mechanism is a hack to inject memory into a memoryless system.
Three papers propose fundamentally different memory architectures. They're not incremental variations — they store different things, update differently, and fail differently. Picking the right one depends on what kind of consistency you need.
Architecture 1: Semantic descriptor registry
VideoMemory (arXiv 2601.03655, Du et al., January 2026) stores explicit visual and semantic descriptors for every entity in your story. Each character gets an entry: face embedding, clothing description, body type, distinguishing features. Each prop gets an entry. Each background location gets an entry.
Before generating shot N, the system queries the registry for entities appearing in that shot and conditions the generation on their descriptors. After generating, it updates the registry if the story changed something — character put on a hat, picked up a bag, moved to a new location.
The update step is what makes this more than a static reference bank. Stories involve change. A character who starts clean-shaven and grows a beard over the story needs their descriptor updated at the right moment. VideoMemory handles this by checking whether the generated output implies a state change and recording it.
Their 54-case benchmark covers three persistence types: character-persistent (same character across many shots), prop-persistent (same object reappearing), and background-persistent (same location revisited). The registry approach handles all three because it stores entities at the semantic level — the "what" is persistent, even as the "how it looks right now" updates.
The weakness: semantic descriptors are lossy. A text description of a character's clothing ("blue denim jacket, white t-shirt") can be re-interpreted differently by the generation model each time. The descriptor captures the concept but not the pixels. For scenes where exact visual reproduction matters — the same pattern on the jacket, the same shade of blue — semantic descriptors aren't precise enough.
Architecture 2: Latent injection
StoryMem (arXiv 2512.19539, Zhang et al., December 2025) stores actual frames, not descriptions. Their Memory-to-Video design maintains a bank of keyframes from previously generated shots and injects them into the generation model via latent concatenation with negative RoPE shifts.
The RoPE shift trick is clever. Rotary Position Embeddings encode temporal position in the generation model's attention. By using negative shifts for memory frames, StoryMem tells the model "these frames are from the past, not the immediate future" — they should inform style and identity without constraining the current shot's motion or composition.
Only LoRA fine-tuning is needed. You don't retrain the base model. You fine-tune a lightweight adapter that teaches the model to attend to memory frames during generation. This means StoryMem can be applied to existing video models without massive compute investment.
Their semantic keyframe selection strategy decides which frames go into the memory bank. Not every frame is equally informative — a clear frontal view of a character's face is better memory material than a blurry background shot. The selection filters for frames that maximize identity signal while minimizing redundancy.
The strength: latent injection preserves pixel-level detail. The model sees the actual visual appearance, not a text approximation. Same shade of blue. Same jacket pattern. Same lighting on the face. For visual consistency specifically, this is more precise than semantic descriptors.
The weakness: the memory bank has finite capacity. As the story grows, you can't store every keyframe from every shot. The selection strategy manages this by keeping only the most informative frames, but for very long sequences (dozens of shots, multiple characters with costume changes), the bank may not have room for everything.
Architecture 3: Continuity graph
StoryBlender (arXiv 2604.03315, April 2026) doesn't just store entities — it models the relationships between them and their properties over time. The continuity memory graph is a structured representation with two types of nodes: global assets and shot-specific variables.
Global assets are properties that persist unless the story explicitly changes them. The character's identity. The bar's physical layout. The car's color. These don't drift — they're locked.
Shot-specific variables are properties that change per shot and shouldn't persist. The character's expression in this moment. The lighting conditions at this time of day. The camera angle for this particular beat. These are transient.
The graph structure prevents the most insidious consistency failure: the system confusing transient properties with permanent ones. If the character is crying in shot 7, that's shot-specific — it shouldn't propagate to shot 8 where she's composed. A flat memory bank that stores "crying" as part of the character's state will propagate it. The graph tags it as transient and drops it.
StoryBlender also instantiates entities in a unified coordinate space. Characters and props exist in one consistent 3D world, even though they're rendered separately per shot. This solves spatial consistency — the character who was standing left of the table in shot 3 is still left of the table in shot 5, not because the model remembers, but because the coordinate system enforces it.
The engine-verified feedback loop catches spatial hallucinations — characters intersecting with furniture, props at impossible positions — by simulating the scene layout and checking for physics violations. Same principle as Mind-of-Director's camera validation, applied to the entire scene graph.
The strength: the graph handles complex, long-form stories with multiple characters, location changes, and temporal progression better than flat registries or keyframe banks. The global/local separation is formally clean and prevents the subtle failures that plague simpler approaches.
The weakness: implementation complexity. You're maintaining a graph database with typed relationships, temporal tags, and coordinate transformations. For a 5-shot scene, it's overkill. For a 50-shot story with 8 characters across 4 locations with costume changes and time progression — that's where the graph earns its keep.
STAGE's memory pack
Worth mentioning separately: STAGE (arXiv 2512.12372, December 2025) introduces the multi-shot memory pack — a compressed representation of prior shots that gets passed to the generation model for each new shot. Combined with their dual-encoding strategy (separate encoders for intra-shot coherence and inter-shot consistency), the memory pack provides a lighter-weight alternative to full graph-based approaches.
Their ConStoryBoard dataset, with cinematic annotations and human preference data, is also useful for training and evaluating any of these memory approaches.
Choosing your architecture
Short sequences (3-8 shots), single location, few characters: VideoMemory's semantic registry. Simple to implement, handles the common case. One database table with entity descriptors. Query before generation, update after.
Medium sequences (5-15 shots), visual precision matters: StoryMem's latent injection. When you need pixel-level consistency and can afford LoRA fine-tuning. The keyframe selection strategy keeps the memory bank manageable. Best when your generation model supports conditioning on reference frames (most recent models do).
Long sequences (15+ shots), multiple locations, complex narratives: StoryBlender's continuity graph. When characters change costumes, locations recur, time passes, and you need the system to know what's permanent and what's momentary. The implementation cost is high but the alternative — manually managing consistency across dozens of shots — is higher.
Hybrid: Nothing stops you from combining approaches. Use a semantic registry for entity tracking (who exists and what they look like generally), latent injection for visual precision (the actual face from the last clear frame), and a simple global/local tag to prevent transient properties from persisting. You don't need StoryBlender's full graph — just the tagging discipline.
The common mistake is starting with the most complex architecture. Start with semantic descriptors. When they fail (and they will, for the reasons described above), add latent injection for the specific entities that need pixel precision. When the story complexity exceeds what flat storage can track, add the global/local separation. Each layer adds consistency at increasing implementation cost. Stop when the consistency is good enough for your audience.
Topics covered