Sequences · AI video storytelling
Generative Expansion: Starting from Footage, Not a Blank Prompt
The entire AI filmmaking conversation assumes you start with nothing. Type a prompt, get a video. Blank canvas to finished film.
The entire AI filmmaking conversation assumes you start with nothing. Type a prompt, get a video. Blank canvas to finished film.
Vidmento (arXiv 2601.22013, Yeh et al., Adobe Research + Harvard, CHI 2026) starts from the opposite assumption. You already have footage. Not enough footage — there are gaps, missing shots, transitions you didn't capture. AI fills those gaps while matching the style and narrative of what you already have.
This is a small but significant reframe. Most creators aren't starting from zero. They shot interviews, captured B-roll, filmed on location. The problem isn't "generate everything" — it's "generate the parts I'm missing."
How Vidmento works
The tool provides three linked views — canvas (story level), editor (scene level), timeline (shot level). You import your existing footage. The system auto-organizes shots into scenes with semantic titles, each color-coded.
Then it does something I haven't seen in other tools: it visualizes narrative gaps. Between your existing scenes, Vidmento proposes connecting scenes — transitions, establishing shots, atmospheric beats that would link the existing material into a coherent story. "Anticipation for CHI Japan" between "Arriving at the airport" and "Conference sessions." The system identifies what's missing narratively and suggests what to generate.
Generated clips get orange outlines. Captured footage gets purple. You always know what's real and what's synthesized. This provenance tracking matters more than it sounds — when you're blending real footage with AI generation, losing track of which is which creates editing and ethical problems downstream.
The generation itself is context-aware. When Vidmento generates a connecting shot, it conditions on the surrounding captured footage — matching color grade, framing style, movement pace, and visual tone. The generated clip should feel like something you could have shot with the same camera on the same day.
The study finding that matters
Twelve creators used Vidmento in an exploratory study. The headline finding: creators with strong-formed visions found AI suggestions less useful. Creators with partial visions — "I know roughly what I want but don't know exactly how to fill the gaps" — got the most value.
This isn't a failure. It's a design insight. AI generation isn't competing with directorial vision — it's serving creators who have vision but incomplete material. The travel vlogger who captured great location footage but missed the transition shots. The documentary maker who has interviews but needs B-roll she couldn't get. The content creator who shot the product demo but needs an opening hook.
AnimAgents (arXiv 2511.17906, November 2025) found something parallel in their formative study with 12 professional animators: existing multi-agent systems are optimized for end-to-end automation, often neglecting human involvement at intermediate stages. The professionals didn't want full automation — they wanted help at specific points in their existing workflow.
Both papers point at the same truth: the "AI replaces the filmmaker" narrative is wrong, and the tools built around that narrative serve the wrong use case. The right framing is "AI augments footage the filmmaker already has." Start from what exists, fill what's missing, keep the human in the loop at every stage.
Generative expansion as a pipeline concept
Strip Vidmento's specific implementation and you get a general pipeline pattern: existing assets → gap analysis → contextual generation → assembly.
This pattern applies beyond video authoring. Anchorless-style autonomous media production follows the same shape: discover existing news (the "footage"), identify missing context or angles (the "gaps"), generate supplementary content (interviews, summaries, analysis), assemble into a coherent output. The assets are different but the structure is identical.
The key technical requirement is context-aware generation. The AI must condition on what already exists — stylistically, narratively, temporally — rather than generating in isolation. Vidmento achieves this through the surrounding footage serving as style and narrative context. The academic multi-shot papers achieve this through memory banks (VideoMemory, StoryMem) and recursive conditioning (Camera Artist RSG).
The gap analysis step is the least developed component. Vidmento does it through narrative structure templates — a story should have an opening, rising action, climax, resolution, and gaps are places where those structural elements are missing. A more sophisticated version would analyze not just narrative structure but visual coverage — "you have wide shots and close-ups but no medium shots," or "you have daytime scenes but the transition to night is abrupt."
What this means for content production
If you're producing content at scale — daily videos, weekly episodes, content pipelines — the generative expansion model changes your production calculus.
Instead of generating everything (expensive, inconsistent, slow), you shoot the anchors (key interviews, hero shots, critical moments) and generate the connective tissue (transitions, B-roll, establishing shots, atmospheric beats). The anchor footage provides the visual ground truth that keeps the generated content consistent.
Instead of starting each piece from a prompt, you start from last week's footage and expand. The visual style carries forward because it's grounded in real footage, not re-described in text each time.
Instead of choosing between "all human production" and "all AI generation," you find the efficient frontier: human effort on the shots that need creative judgment, AI generation on the shots that need coverage.
Vidmento is a research prototype, not a product. But the pattern it demonstrates — start from what you have, fill what's missing, keep track of what's real — is implementable today with existing generation tools and a good content management layer. The tool doesn't have to be Vidmento specifically. The paradigm is what matters.
Topics covered