Multi-agent · AI filmmaking

What 100 AI Filmmakers Actually Want: The CVPR Artist Survey Decoded

A hundred AI filmmakers walked into a survey and the researchers actually listened. The results (arXiv 2504.08296, Zhang et al., CVPR 2025 Workshop) are buried in an academic paper, which means the people who most need to read them — tool builders — probably haven't.

Here's what the artists said, and where the research papers are already building what they asked for.

The priority ranking nobody expected

Consistent character movement ranked first. Not character appearance. Movement. The distinction matters. Your character can look the same across shots but if she moves differently — walks with a different gait, gestures at a different tempo, occupies space with different physicality — the film breaks.

Camera control ranked second. Artists want to specify dolly shots, pans, rack focus, framing — the vocabulary of cinematography — and have the model execute it precisely. "Slightly low angle" isn't precise enough when you're building a visual sequence.

Overall character consistency ranked third. The appearance problem — same face, same clothes across shots — is the one everyone talks about, but the survey ranked it below movement and camera. I think this is because practitioners have workarounds for appearance (reference images, LoRAs) but nothing for movement or camera.

This ranking should reorder how the field invests. Most papers focus on appearance consistency (VideoMemory, CANVAS, StoryBlender, DreamShot — see article 3). Far fewer address movement consistency or precise camera control. The artists are telling us the research emphasis is pointed at the third-most-important problem.

MIT AI Film Hack adoption data

The survey tracked tool adoption across the MIT AI Film Hack from 2023 to 2025. Some patterns:

Nearly all participants used AI image or video generation. By 2025 it's table stakes, not a choice.

3D generation tool usage jumped from 0% in 2023 to 23.7% in 2025. The organizers added a dedicated 3D track in 2024, which probably catalyzed this. But the growth rate — zero to a quarter in two years — signals a real shift. 3D generation solves spatial consistency problems that 2D generation can't, because 3D models exist in a consistent coordinate space. This aligns with ShotVerse's camera calibration work (arXiv 2603.11421) and StoryBlender's unified coordinate space (arXiv 2604.03315).

Most films retained a cartoonish style despite improving realism in the tools. The survey explanation: inconsistencies are less noticeable in stylized content than in photorealistic content. If your character's face drifts slightly in a cartoon, viewers accept it. In photorealistic video, the same drift triggers the uncanny valley. Artists are self-selecting into styles that hide the technology's weaknesses.

Artists use multiple generation tools per film, and the number is increasing. One model for landscapes, another for characters, a third for action — because different models have different strengths. This multi-model workflow aligns perfectly with the agent-orchestrated paradigm (article 1, article 5), where a planning layer dispatches to the best model per shot.

Prompt crafting is critical and underserved

The survey found that artists spend significant effort crafting detailed prompts, often using prompt rewriting tools to get better results. Multiple generation iterations per shot are standard — artists don't accept the first output, they iterate.

This directly supports Camera Artist's Cinematic Language Injection (arXiv 2604.09195) — a fine-tuned model that transforms generic descriptions into film-specific shot specs. Artists are already doing this manually, spending time translating their creative intent into model-friendly language. An automated translator would save that time.

It also validates Mind-of-Director's template approach. If artists are writing the same shot descriptions repeatedly with minor variations, parameterized templates capture the invariant structure and let them specify only what changes.

What artists need that doesn't exist yet

Reading between the survey responses, three needs stand out:

Movement consistency tools. The #1 ranked priority has no dedicated solution. VideoMemory stores visual descriptors but not motion patterns. Camera Artist's recursive conditioning preserves camera continuity but not character motion style. AniMaker's MCTS (arXiv 2506.10540) evaluates clip quality but not motion consistency specifically. This is an open problem.

Real-time creative iteration. Artists iterate heavily — multiple generations, prompt adjustments, model swaps. But current workflows are batch: generate, download, evaluate, adjust prompt, regenerate. ShotStream's streaming holistic generation (arXiv 2603.25746) moves toward real-time, but it's research, not a product. Vidmento (arXiv 2601.22013, CHI 2026) designed an interactive canvas for iterative video authoring, but it's focused on captured+generated hybrid workflows rather than pure generation.

Multi-model orchestration. Artists already use multiple models per film. But they manage this manually — generate in Runway, download, generate in Kling, download, edit in Premiere. The agent-orchestrated paradigm (FilmAgent, Mind-of-Director, MovieAgent) automates this orchestration, but none of them are products yet. They're all research prototypes.

The gap between research and tools

The survey was published in April 2025 and reflects the state of practice up to that point. In the fourteen months since, the research has advanced significantly — 33 papers addressing the problems artists described. But almost none of that research has reached production tools.

Character consistency (#3): VideoMemory, CANVAS, StoryBlender, DreamShot, StoryMem — five paper-level solutions. Commercial implementation: basic reference image conditioning in Kling and Runway.

Camera control (#2): Mind-of-Director templates, Camera Artist CLI, ShotVerse calibration, FilMaster RAG — four approaches. Commercial implementation: Runway's Director Mode, which lets you specify camera language but doesn't enforce cross-shot consistency.

Movement consistency (#1): essentially nothing in either research or commercial tools.

The artists know what they need. The researchers are building parts of it. The tool companies haven't shipped most of it. That gap — between what practitioners ask for and what products deliver — is where the next wave of AI filmmaking tools will compete. Whoever builds the first product that solves consistent movement, precise camera control, and multi-model orchestration in one pipeline will own the market the survey describes.

The $3.24B→$23.54B market projection (Grand View Research) depends on closing this gap. Right now, the tools are good enough for short-form clips and cartoonish style. The artists want cinema. The research points the way. The products don't exist yet.

Topics covered

AI filmmakingAI cinematographycharacter consistency AI videoCVPR artist survey AI filmMIT AI Film Hack

What 100 AI Filmmakers Actually Want: The CVPR Artist Survey Decoded

The priority ranking nobody expected

MIT AI Film Hack adoption data

Prompt crafting is critical and underserved

What artists need that doesn't exist yet

The gap between research and tools

The AI Film Crew: How Multi-Agent Systems Are Replacing Solo Prompting

Discuss-Revise-Judge vs Debate-Judge-Validation: Picking Your Collaboration Pattern

The $23B AI Filmmaking Market: Where Research Points and Money Flows