Multi-agent · AI director agent

Building an AI Director Skill: From Paper to Pipeline

Thirty-three papers say multi-agent beats single-agent for film. Zero of them ship a product. This article bridges the gap — the implementation patterns for building a multi-agent AI director pipeline from the components the research describes.

Six components. Each has a paper source, a concrete function, and an implementation sketch. You don't need all six. Start with the one that addresses your biggest quality gap.

Component 1: Shot template vocabulary

Paper source: Mind-of-Director (arXiv 2603.14790), Section III-D, Camera Template Library.

What it does: Replaces prose shot descriptions with parameterized templates. Instead of "medium shot, low angle" (ambiguous), the system outputs SINGLE_STATIC_MED(subject=Alice, distance=2.0, angle=low_15deg, lens=50mm) (unambiguous).

Implementation: Define your templates as a YAML schema. Start with Mind-of-Director's 21 categories: 12 single-person (static/dynamic × wide/medium/close-up/extreme-close-up × eye-level/low/high), 8 two-person (static/dynamic × OTS/two-shot/split-screen variants), 1 group. Each template has typed parameter slots.

templates:
  SINGLE_STATIC_MED:
    params:
      subject: {type: character_id, required: true}
      distance: {type: float, range: [1.5, 3.0], default: 2.0}
      angle: {type: enum, values: [eye, low_15, low_30, high_15, high_30]}
      lens: {type: int, values: [24, 35, 50, 85, 135], default: 50}
      movement: {type: enum, values: [static, slow_push, slow_pull]}
    usage: "Single character, waist-up framing, suitable for dialogue delivery"

The LLM's prompt includes the full template vocabulary. It selects a template name and fills params. The template is then expanded into the actual generation prompt with exact specifications.

Why it works: Mind-of-Director's camera accuracy went from 64.4% to 79.2%. Templates eliminate the re-interpretation ambiguity that causes framing drift between shots.

Component 2: Four critic role prompts

Paper source: Mind-of-Director (arXiv 2603.14790), Algorithms 1 and 2; GenMAC (arXiv 2412.04440), REDESIGN stage.

What it does: Splits the quality review from one agent's mental pass into four role-specific critics, each evaluating a different concern. The roles: Continuity (do visual elements match across shots?), DP/Cinematographer (does the camera language serve the scene?), Performance (do character actions read correctly?), Comprehension (does the audience understand what's happening?).

Implementation: Four system prompts, each focused on one concern. After the director agent produces the shot plan, each critic evaluates it:

The Continuity critic checks: same wardrobe as last shot? Same props present? Background consistent with establishing shot? Character positions plausible given blocking?

The DP critic checks: does the shot template serve the emotional beat? Is the framing varied enough from the previous shot to justify a cut? Would a real cinematographer choose this setup for this moment?

The Performance critic checks: does the character's action match the dialogue? Is the body language readable in this framing? Would the actor's motivation be clear?

The Comprehension critic checks: does the audience know where they are? Is the spatial geography of the scene clear? Is the narrative beat landing?

Each critic returns a pass/fail per concern with specific notes on failures. The director agent revises based on the combined feedback. After revision, the director judges whether to approve. This is Mind-of-Director's Discuss-Revise-Judge loop split across four perspectives.

Why it works: The ablation shows ~5pp motion accuracy gain from structured multi-role critique. The cost is 4-5 additional API calls per shot. The gain is catching failures that a single agent's "mental sweep" misses because it's checking too many things at once.

Component 3: Debate loop for camera

Paper source: Mind-of-Director (arXiv 2603.14790), Algorithm 2, Debate-Judge-Validation.

What it does: For camera decisions specifically, generates two independent proposals and has them cross-critique before the director chooses.

Implementation: Two DP agent instances (same system prompt, different seeds or temperature). Each receives the scene context, blocking info, and template vocabulary. Each independently proposes a shot setup. Then each critiques the other's proposal — "your OTS from the left loses the window light that motivates the scene" or "your wide is too distant for the emotional intimacy of this beat."

The director agent receives both proposals and both critiques, then selects the winner (or synthesizes a hybrid). The selected shot goes to validation (Component 4).

Why it works: Camera collision dropped from 9.6% to 2.1% under Debate-Judge-Validation. Two proposals generate genuine optionality. The cross-critique surfaces problems that self-critique misses because each agent defends a different position.

When to skip it: For routine shots where the template choice is obvious (establishing wide, standard OTS for dialogue). Use the debate loop for shots where the creative choice is genuinely uncertain — key emotional beats, transitions, climactic moments.

Component 4: VLM still-audit gate

Paper source: Agentic Video Generation (arXiv 2604.10383), engine validation; Mind-of-Director engine simulation; MUSE (arXiv 2602.03028), verify step.

What it does: After generating a still frame but before promoting to motion rendering, a VLM checks the still against the prompt and flags mismatches.

Implementation: One VLM call (GPT-4o, Claude, Gemini Pro) per still. Input: the rendered still, the generation prompt, the shot template params, character reference images, scene anchor image. Prompt: "Compare this image against the specification. Check: (1) character identity matches references, (2) wardrobe matches prompt, (3) framing matches template params, (4) background matches scene anchor, (5) props present and correct, (6) spatial arrangement plausible. Return pass/fail per check with failure descriptions."

If all pass → promote to motion rendering. If any fail → log the failure reason, adjust the prompt to address the specific failure, regenerate the still, re-audit. After 3 failed retries → flag for human review.

Why it works: Agentic Video Generation showed 58% physical validity with validation vs 25% without. Each VLM call costs fractions of a cent. Each avoided failed motion render saves $0.10-$1.00+ depending on your generation tool. The ROI is immediate.

Component 5: Entity memory registry

Paper source: VideoMemory (arXiv 2601.03655), Dynamic Memory Bank; StoryBlender (arXiv 2604.03315), continuity memory graph.

What it does: Maintains a per-episode database of characters, props, and locations with their current visual state. Updated after each shot. Queried before each generation.

Implementation: A simple data store (JSON file, SQLite, Redis — doesn't matter at this scale):

{
  "characters": {
    "alice": {
      "face_reference": "path/to/alice_face.png",
      "current_wardrobe": "blue denim jacket, white t-shirt, dark jeans",
      "current_state": "standing, composed",
      "last_seen_shot": 5
    }
  },
  "locations": {
    "bar": {
      "anchor_image": "path/to/bar_establishing.png",
      "description": "dim interior, wooden counter, bottles on shelves",
      "time_of_day": "evening"
    }
  },
  "props": {
    "whiskey_glass": {
      "description": "short rocks glass, amber liquid",
      "current_holder": "alice",
      "location": "bar_counter"
    }
  }
}

Before generating shot N: query all entities appearing in the shot, include their current state in the prompt, pass reference images and scene anchor.

After generating shot N: update any entities whose state changed (character picked up a prop, changed expression, moved to a new position). Use the VLM audit output to verify the update — did the generation actually show what the prompt specified?

Tag state as permanent (identity, physical appearance) or transient (expression, pose, lighting). Only permanent state propagates automatically. Transient state gets the current shot's specification, not the previous shot's state.

Why it works: VideoMemory's benchmark shows consistent entity portrayal across distant shots. The retrieval-update loop prevents both drift (gradual change without cause) and staleness (failing to update when the story changes something).

Component 6: Audience evaluation pass

Paper source: FilMaster (arXiv 2506.18899), Audience-Centric Cinematic Rhythm Control.

What it does: After assembling the full sequence, an LLM evaluates it as a simulated viewer and provides feedback on pacing, engagement, and narrative flow.

Implementation: One API call on the assembled cut. Input: the sequence of shot descriptions (with templates, blocking, dialogue), the overall scene/episode context. Prompt: "You are a first-time viewer watching this sequence. Evaluate: (1) pacing — too fast, too slow, or right for the genre? (2) attention — where would your focus drift? (3) cuts — any jarring transitions? (4) engagement — rate 1-10 and explain. (5) dead zones — any stretches where nothing pulls you forward?"

The feedback informs post-assembly refinement: trim shots the viewer found slow, add beats where attention drops, flag jarring cuts for regeneration or reordering.

Why it works: FilMaster's Rough Cut → Fine Cut pipeline mirrors real post-production. Technical quality checks (Components 1-5) ensure the shots are correct. The audience check ensures the sequence is watchable. These are different things.

Putting it together

The full pipeline: Script → blocking → shot template selection (Component 1) → DP debate for key shots (Component 3) → four-critic review (Component 2) → still generation → VLM audit (Component 4) → motion rendering → assembly → audience evaluation (Component 6). Entity registry (Component 5) runs throughout, queried before generation and updated after.

You don't need all six on day one. Priority order by impact-per-effort:

VLM still-audit gate (Component 4) — one function, immediate ROI, catches the most expensive failures
Shot template vocabulary (Component 1) — one YAML file, eliminates framing drift
Entity registry (Component 5) — one data store, prevents consistency failures across shots
Four critic roles (Component 2) — four prompts, catches quality issues the single-pass misses
DP debate (Component 3) — for key shots only, highest-quality camera decisions
Audience evaluation (Component 6) — post-assembly, catches engagement problems

Add them in that order. Each layer compounds with the previous ones. The first three give you 80% of the quality gain for 20% of the implementation effort.

Topics covered

AI director agentAI video agent skillsAI video pipeline automationAI director skill implementationVLM still audit gateshot template YAML