Creative

AI Content
Pipeline

A production multimodal pipeline that takes a long-form manuscript and produces finished narrated video. Scene segmentation, character extraction with reference-anchored visual identity, voice synthesis with consistency across hours of audio, automated audio-video synchronisation, and quality gates at every stage.

Sector Creative production / publishing
Year 2026
Stack Python · OpenAI Image · ElevenLabs · FFmpeg · LangGraph
Type Production pipeline

Turning a manuscript into a finished video should cost a manuscript, not a film crew.

The unit economics of video content used to be set by the cost of human labour at every stage: an illustrator, a voice actor, a director, an editor, a sound engineer. AI tools have collapsed each of those individually. The hard part is making them work together at length without the seams showing.

This pipeline takes a manuscript of arbitrary length and produces narrated video with visually consistent characters and a single coherent narrator voice. Hours of finished content, not a single clip. The brief was end-to-end automation with quality high enough that the output is the product, not a draft.

Every stage is a different model, and the consistency happens between them.

The text-to-shot decomposition is an LLM task. Character description and identity anchoring is a multimodal task. Image generation is a diffusion task. Voice synthesis is a TTS task. Audio-video alignment is a deterministic task on top of all of them.

No single model does this. The engineering is in stitching them together in a way that preserves identity across modalities. A character described in chapter one has to look the same in chapter twelve and sound the same throughout. Drift is the enemy.

How it was built.

Pipeline schematic: manuscript, scene segmentation, character extraction, image generation, voice synthesis, A/V sync
Manuscript flows through scene decomposition, character anchoring, parallel image and voice generation, then audio-video synchronisation.
  1. Scene segmentation. An LLM decomposes the manuscript into scenes, then into shots. Each shot has a setting, a cast list (which characters are present), a narrative function, and a target duration. The decomposition is the spine of the pipeline; every downstream stage references shot IDs.
  2. Character extraction and identity anchoring. A first pass enumerates every character in the manuscript and writes a canonical visual description. A second pass generates the canonical reference image for each. From that point forward, every image-generation call for a shot receives the character's reference image as conditioning. Identity drift across hundreds of generations falls to near zero.
  3. Image generation per shot. Each shot prompt is composed from setting plus character references plus narrative beat. Generations are evaluated against the shot's intent by a vision-language model before being accepted. Failed shots are retried with adjusted prompts, not silently shipped.
  4. Voice synthesis with prosody control. Narration is generated through a TTS provider with a stable voice clone. Prosody markers are inserted by an LLM pre-pass based on narrative beat (calm, tense, climactic). The narrator sounds like one human across hours of audio.
  5. Audio-video synchronisation. Audio durations come back from TTS; shot durations are stretched or compressed to match using FFmpeg, never the other way around. Audio is sacred; video bends to it. The reverse produces robotic pacing.
  6. Quality gates at every stage. Each stage emits structured artefacts that the next stage consumes. A failure at any stage halts the run and surfaces the offending artefact for human review. No silent failures, no looks fine automation.

Where the seams threaten to show.

Character drift across diffusion models. A character generated in twenty different shots will tend to drift — different hair, different age, different ethnicity. Reference-anchored conditioning is the only reliable fix. Text prompting alone fails at scale.

Voice consistency under variable text. A TTS voice clone holds well across paragraphs of similar register and breaks at register shifts (dialogue, internal monologue, technical asides). The pipeline tags every passage with register before TTS and switches voice configurations accordingly.

Cost control. A manuscript can produce hundreds of shots, each with an image generation and several minutes of TTS. Naive pipelines burn money. Aggressive caching of accepted artefacts, plus tiered model routing (cheap model for first attempt, frontier model only on retry) keeps unit cost predictable.

Continuity tracking. Time of day, location, what each character is wearing, what props are visible — all of this has to stay consistent across shots that are generated in parallel. A shared continuity state machine threads through every shot prompt.

What it produced.

Hours
Of finished, consistent narrated video per manuscript
End-to-end
Manuscript-to-video with no human in the per-shot loop
Anchored
Character identity preserved across hundreds of generations

The pipeline is a working example of what AI-native content production looks like when the engineering is taken seriously. The same architecture transfers to corporate use cases — training video, internal communications, product explainers — where the unit economics of human-produced video have always been the blocker.

Back to all work Discuss a project