About the Project — OpenScenes

Inspiration

The inspiration for OpenScenes came from a recurring frustration we observed while building and presenting complex technical ideas. Turning a raw concept, research document, or dataset into a clear, visually compelling video presentation is still a highly manual process. It requires switching between multiple tools, learning design conventions, and spending more time on formatting than on thinking.

We asked a simple question:

Can we treat presentations as a programmable output of intent?

If large language models can reason, plan, and structure information, why can’t they also direct a presentation—deciding narrative flow, slide composition, and visual hierarchy—while still allowing humans to intervene at any level?

OpenScenes was born from this idea: to build an AI-native presentation engine where users express what they want, not how to design it.

What We Learned

Building OpenScenes taught us several important lessons:

AI needs structure, not just prompts
High-quality outputs came from treating AI as part of a system, not a single call. Decomposing the problem into intent extraction, narrative planning, and slide generation dramatically improved reliability.
Natural language is a powerful editing interface
Users think in goals (“make this clearer”, “compare these two ideas”), not in design primitives. Supporting edits at global, slide, and element levels aligned the system with how people naturally communicate.
Video rendering is an infrastructure problem
Rendering videos is compute-heavy and slow. Treating rendering as an asynchronous, distributed workload was essential to keep the user experience responsive.
Determinism matters in creative systems
To make AI-generated content editable, we needed deterministic, structured representations (JSON-based slide trees) instead of free-form text or images.

How We Built It

OpenScenes is built as a full-stack, event-driven system designed for scalability and iteration.

Architecture Overview

At a high level, the system separates interaction, intelligence, and execution:

User Intent
→ AI Planning
→ Structured Slides
→ Video Rendering

AI Pipeline

We implemented a multi-agent pipeline using Google Gemini models:

Summarizer Agent extracts intent and key points from prompts and uploaded files.
Director Agent plans the narrative arc and visual style of the presentation.
Slide Generator Agent produces a structured JSON representation of slides, layouts, and elements.

Each agent is isolated, stateless, and replaceable, allowing us to iterate quickly on behavior and quality.

Editing System

We designed a tiered editing model that allows natural language edits at three granularities:

Global (theme, typography, color)
Slide-level (layout, structure)
Element-level (text, emphasis, sizing)

These edits are compiled into deterministic transformations on the slide tree, ensuring predictability and reversibility.

Rendering Engine

For video export, we built a distributed rendering system using Remotion:

Render jobs are queued via RabbitMQ.
Stateless Node.js workers consume jobs and render videos.
Assets and outputs are stored in S3-compatible object storage.
Redis is used for real-time job status tracking and cancellation.

This design allows horizontal scaling and non-blocking user interactions.

Challenges We Faced

Balancing Creativity and Control

One of the hardest problems was allowing AI to be creative without making the output uneditable. We solved this by enforcing strict schemas and layout constraints while giving the AI freedom within those boundaries.

Latency and User Experience

AI generation and video rendering both introduce latency. Keeping the editor interactive required careful decoupling of synchronous UI actions from asynchronous background jobs.

Consistency Across Edits

Natural language edits can be ambiguous. Ensuring that repeated edits behaved consistently required building an internal representation that the AI could reason about and modify deterministically.

Debugging Distributed Systems

Running AI workers, render workers, queues, and storage locally introduced operational complexity. Docker Compose was critical in making the system reproducible and hackathon-ready.