← All articles06 · 5 min

Multi-agent · multi-agent film production

Discuss-Revise-Judge vs Debate-Judge-Validation: Picking Your Collaboration Pattern

Mind-of-Director uses both patterns — and uses them for different stages. That's the tell. If one pattern were universally better, they'd use it everywhere. They don't. The choice of collaboration pattern is an engineering decision with measurable tradeoffs, and the paper's own architecture is the clearest evidence for when each one fits.

Mind-of-Director uses both patterns — and uses them for different stages. That's the tell. If one pattern were universally better, they'd use it everywhere. They don't. The choice of collaboration pattern is an engineering decision with measurable tradeoffs, and the paper's own architecture is the clearest evidence for when each one fits.

The two core patterns

Discuss-Revise-Judge works like a writing workshop. One agent drafts. Other agents critique. The drafter revises based on the critique. A judge (the director) approves or sends it back for another round. In Mind-of-Director (arXiv 2603.14790), this pattern handles scripts and character blocking — tasks where you're iteratively improving a single artifact.

The loop: Screenwriter drafts act → Actors discuss and give feedback → Screenwriter revises → Director judges → approve or loop back. Algorithm 1 in the paper. Roughly 3x the token cost of a single pass because you're generating a draft, a critique, a revision, and a judgment.

Debate-Judge-Validation works like a competitive pitch. Two agents independently propose solutions. They cross-critique each other. A judge picks the better option. Then a validator (the game engine, a VLM, a physics sim) checks whether the chosen option is physically possible. If not, the system loops back with validation feedback.

The loop: Cinematographer A proposes shot → Cinematographer B proposes shot → They critique each other → Director selects winner → Engine validates for collisions/occlusions → if pass, commit; if fail, director adjusts params and re-validates. Algorithm 2 in the paper. Roughly 4x baseline tokens because you're generating two proposals, two critiques, a selection, and a validation.

Why Mind-of-Director uses DRJ for scripts and DJV for camera

Scripts benefit from iterative refinement. The first draft isn't wrong in a "pick the other option" way — it's wrong in a "this line is clunky, fix it" way. You want the same drafter to improve their own work based on specific feedback. Having two screenwriters independently write the same scene and picking the better one would waste half the generation. The draft is the valuable artifact; the critique makes it better.

Camera shots benefit from choosing between alternatives. A medium shot and a close-up are both valid choices for the same line of dialogue — the question is which serves the scene better. Having one cinematographer propose and then critique their own proposal doesn't generate alternatives. You need two independent proposals to get genuine optionality. And camera choices have a binary physical validation: the trajectory either clips through geometry or it doesn't.

The collision rate data makes this concrete. Camera collision dropped from 9.6% to 2.1% under Debate-Judge-Validation. The two-proposal step generates diversity. The engine validation catches impossibilities. Neither mechanism exists in Discuss-Revise-Judge, and for camera planning, both matter.

The cheaper alternatives

Not every stage justifies 3-4x token overhead. The spectrum from cheapest to most expensive:

Recursive conditioning (Camera Artist, arXiv 2604.09195) extends the context window so each shot conditions on the previous one. About 1.2x baseline cost. Gets you shot-to-shot continuity but doesn't catch structural problems. Best for: maintaining consistency in straightforward sequences where each shot logically follows the last and creative decisions are low-stakes.

Hierarchical CoT (MovieAgent, arXiv 2503.07314) structures a single agent's reasoning through layers: theme → scene composition → emotional tone → per-shot parameters. About 1.5x baseline. Forces the model to commit to high-level decisions before getting lost in details. Best for: when you trust the model's judgment but want to prevent it from skipping narrative-level reasoning. Cheaper than DRJ because there's no separate critic — the hierarchy itself constrains the reasoning.

Self-routing correction (GenMAC, arXiv 2412.04440, AAAI 2026) adds marginal cost on top of whatever correction loop you're using. After verification diagnoses a failure, self-routing dispatches it to a specialist correction agent instead of a generalist. The token overhead is one routing decision per failure. Best for: when your failure modes are diverse — attribute binding errors need different fixes than spatial errors, which need different fixes than motion errors. A single "fix everything" agent hallucinates solutions for problems it doesn't understand.

The expensive alternatives

Multi-Armed Bandit (Co-Director, arXiv 2604.24842, Google) explores multiple creative directions globally. You're not refining one draft or choosing between two proposals — you're exploring N narrative strategies and exploiting the effective ones. The token cost scales with N. Best for: when the creative direction itself is uncertain. If you don't know whether to open on the product or the problem or a testimonial, MAB explores the space systematically. Overkill for scenes where the creative direction is clear and you just need quality execution.

MCTS (AniMaker, arXiv 2506.10540) runs Monte Carlo Tree Search over candidate clips. For each shot, generate multiple candidates. For each sequence of shots, evaluate the global quality of the combination. Tree search finds the sequence that works best as a whole, not just shot-by-shot. Best for: long sequences where one bad clip in position 3 cascades through positions 4-10. The compute cost is proportional to the branching factor times the sequence length, which gets expensive fast.

The decision framework

Is the task about improving a single artifact? → DRJ. Scripts, dialogue, character descriptions, scene descriptions. The artifact gets better through critique.

Is the task about choosing between fundamentally different options? → DJV. Camera setups, creative directions, transition styles. You need diversity before selection.

Is the task straightforward with low creative stakes? → Recursive conditioning or hierarchical CoT. Scene-to-scene continuity, simple dialogue sequences. Don't pay for debate when the answer is obvious.

Are your failure modes diverse and hard to diagnose? → Add self-routing. Let a diagnostic step classify the failure before dispatching a fix.

Is the creative direction itself uncertain? → MAB. When you don't know what you're making, explore before you commit.

Is the sequence long enough that local quality doesn't guarantee global quality? → MCTS. When shot interactions matter more than individual shot quality.

In practice, most pipelines should use hierarchical CoT as the default (cheap, structured, works for most stages), escalate to DRJ for the script and blocking stages (the artifacts worth refining), and use DJV only for camera planning (where the diversity and validation both pay off). GenMAC's self-routing can layer on top of any of these. MAB and MCTS are for high-budget pipelines where the cost of exploration is justified by the cost of the final output.

The token math is the deciding factor. If your pipeline generates a 12-shot video and each shot takes 4 API calls under DJV, that's 48 calls. Under recursive conditioning, it's 12. The quality difference is real — Mind-of-Director's ablation proves it — but so is the cost difference. Know your budget, pick your pattern, and don't over-engineer stages that don't need it.

Topics covered

multi-agent film productionAI director agentdebate judge validation AIdiscuss revise judge pattern