Inspiration

We have all seen AI-generated short-form content on social media, but is it possible to make coherent long-form content with a single prompt?

This is what we wanted to explore for our project: can you go from a one-sentence idea to a fully rendered short film with consistent characters, coherent scene transitions, and real cinematic motion, all automated?

Existing text-to-video tools generate isolated clips with no narrative structure, inconsistent characters, and no user control over the visual direction. We set out to build a full film pipeline that solves all three.

What it does

FilmStitch is an end-to-end AI film pipeline. You type a film idea, and it produces a ~80-second short film with no manual editing or prompting. The pipeline runs in three phases, each pausing for creative user input before continuing.

Phase 1 — Writing and reference generation:

  1. Writes a 10-segment script via Claude Sonnet, structured with narration, action, dialogue, and per-segment character tracking
  2. Generates detailed visual character descriptions via Claude — specific clothing, hair, facial features, and build that an image model can reproduce consistently
  3. Generates a unified environment description via Claude — lighting, materials, color palette, and atmosphere
  4. Produces 11 boundary keyframe descriptions, with character appearance details embedded directly into each FLUX prompt for consistency
  5. Renders character and environment reference images (3 variants each) on Modal using FLUX.1-dev

User checkpoint: choose preferred reference variants for each character and environment

Phase 2 — Keyframe image generation:

  1. Renders 33 keyframe images (11 × 3 variants) on Modal using FLUX.1-dev, with multi-reference CLIP conditioning — the character and environment refs relevant to each keyframe boundary are averaged into a single embedding and blended into FLUX's pooled text embedding

User checkpoint: choose preferred keyframe variant for each of the 11 boundary frames

Phase 3 — Video and assembly:

  1. Interpolates 10 video segments in parallel using LTX-Video-0.9.5 on Modal, with first-frame and last-frame conditioning pinning the chosen keyframes at each end of the clip
  2. Assembles the final film with ffmpeg concat

How we built it

  • Orchestration: Python async pipeline (pipeline.py) coordinating four separate Claude API calls and three rounds of Modal GPU jobs across three distinct phases
  • Script generation: Claude Sonnet produces a structured JSON script with per-segment character tracking and environment tagging
  • Character + environment descriptions: Two additional Claude calls extract detailed, camera-ready visual descriptions before any image generation begins — this is what makes FLUX prompts specific enough to produce consistent character appearances without fine-tuning
  • Keyframe descriptions: A fourth Claude call uses those descriptions directly, embedding specific clothing and appearance details into every keyframe prompt rather than relying on the model to infer them
  • Image generation: FLUX.1-dev on Modal (A100-40GB). Multi-reference CLIP conditioning: for each keyframe, we encode all character refs present at that boundary plus the environment ref, average their embeddings, and blend the result (30%) into FLUX's pooled text embedding. The full cinematic prompt routes through FLUX's T5 encoder, which has no token length limit
  • Video generation: LTX-Video-0.9.5 on Modal (A100-40GB) via LTXConditionPipeline. Dual-frame conditioning pins first and last frames using LTXVideoCondition objects; intermediate motion is guided by the segment's narration and action text. 10 segments dispatch concurrently with concurrency_limit=10, running on separate containers simultaneously
  • Server: FastAPI with WebSocket for real-time progress streaming across all three phases. Three new endpoints handle the phase boundaries: /pipeline/start, /pipeline/{id}/gen-keyframes, and /pipeline/{id}/continue
  • Frontend: React SPA with a step-by-step wizard — idea input → live pipeline log → reference picker → keyframe timeline with per-boundary variant selection → parallel clip progress grid → final download
  • Assembly: ffmpeg concat demuxer for gapless segment joining, with re-encoding for codec consistency across clips

Challenges we faced

CLIP's 77-token hard limit in the IP-Adapter path. FLUX has two text encoders: T5-XXL for the main prompt (handles ~512 tokens, drives image quality) and CLIP for pooled_prompt_embeds (hard 77-token ceiling). Our character conditioning code called encode_prompt() with the full keyframe prompt to get the CLIP pooled embedding for blending. Prompts longer than 77 tokens were silently truncated, producing a conditioning signal derived from a mangled version of the prompt. The fix passes a 50-word summary to encode_prompt() while the full prompt still travels through T5 via the main pipeline call — two separate paths, independently controlled.

Cold start latency. First LTX-Video container spin-up downloads ~10 GB of weights and takes 3–5 minutes. We mitigated this with min_containers=1 on both Modal classes.

Character consistency across keyframes. Pure text-to-image generation produces wildly different character appearances across 11 separate generations. We tried three approaches: text-only (poor), single character reference with CLIP blend (significantly better), and averaged multi-reference conditioning including environment (current, best). True per-subject IP-Adapter attention injection would require deeper pipeline surgery and is on the roadmap.

Why Modal?

The pipeline needs burst GPU compute — 33 FLUX image generations followed by 10 LTX-Video generations, then nothing until the next film. Traditional GPU provisioning (reserved instances) would be extremely wasteful. Modal provides:

  • Pay-per-second billing — only charged during active inference
  • Auto-scaling — spins up containers on demand, scales to zero when idle
  • Persistent volumes — model weights cached across invocations (flux-weights, ltx-weights volumes)
  • Horizontal scalingmax_containers=10 for LTX-Video means 10 parallel A100s during video generation

What we learned

  • The N+1 keyframe model (11 boundaries for 10 segments) is a clean abstraction for narrative video — shared transition frames naturally enforce visual continuity without any explicit stitching logic
  • Generating character descriptions as a dedicated Claude step before image generation — rather than relying on the image model to infer character details from a brief name — dramatically improves consistency across long generations
  • Separating the pipeline into three explicit phases with user checkpoints between them is the right UX model for generative workflows: it catches bad reference or keyframe choices before the expensive video generation step, saving significant time and cost

Example

Check out this video generated by a simple prompt "a stranded astronaut on the moon walks around and discovers nature"! https://youtu.be/FUdKMRBkudo

What's next

  • Better character consistency: replace averaged CLIP blending with proper per-subject IP-Adapter attention injection
  • Environment conditioning in video: pass the environment reference image to LTX-Video as an additional style conditioning signal to maintain visual consistency in background elements across clips
  • Audio pipeline: narration from segment dialogue + ambient sound generation synced to clip timing
  • Inpainting checkpoint: allow users to edit individual keyframe regions before video generation begins, fixing issues like wrong character position or background elements
  • Longer films: hierarchical scene planning (acts → scenes → segments) to support 5–10 minute narratives with consistent world-building
  • Higher resolution output: decoding optimizations to push LTX-Video to 1024×576 without OOM on A100-40GB

Built With

Share this project:

Updates