About the Project — OpenScenes
Inspiration
The inspiration for OpenScenes came from a recurring frustration we observed while building and presenting complex technical ideas. Turning a raw concept, research document, or dataset into a clear, visually compelling video presentation is still a highly manual process. It requires switching between multiple tools, learning design conventions, and spending more time on formatting than on thinking.
We asked a simple question:
Can we treat presentations as a programmable output of intent?
If large language models can reason, plan, and structure information, why can’t they also direct a presentation—deciding narrative flow, slide composition, and visual hierarchy—while still allowing humans to intervene at any level?
OpenScenes was born from this idea: to build an AI-native presentation engine where users express what they want, not how to design it.
What We Learned
Building OpenScenes taught us several important lessons:
AI needs structure, not just prompts
High-quality outputs came from treating AI as part of a system, not a single call. Decomposing the problem into intent extraction, narrative planning, and slide generation dramatically improved reliability.Natural language is a powerful editing interface
Users think in goals (“make this clearer”, “compare these two ideas”), not in design primitives. Supporting edits at global, slide, and element levels aligned the system with how people naturally communicate.Video rendering is an infrastructure problem
Rendering videos is compute-heavy and slow. Treating rendering as an asynchronous, distributed workload was essential to keep the user experience responsive.Determinism matters in creative systems
To make AI-generated content editable, we needed deterministic, structured representations (JSON-based slide trees) instead of free-form text or images.
How We Built It
OpenScenes is built as a full-stack, event-driven system designed for scalability and iteration.
Architecture Overview
At a high level, the system separates interaction, intelligence, and execution:
User Intent
→ AI Planning
→ Structured Slides
→ Video Rendering
AI Pipeline
We implemented a multi-agent pipeline using Google Gemini models:
- Summarizer Agent extracts intent and key points from prompts and uploaded files.
- Director Agent plans the narrative arc and visual style of the presentation.
- Slide Generator Agent produces a structured JSON representation of slides, layouts, and elements.
Each agent is isolated, stateless, and replaceable, allowing us to iterate quickly on behavior and quality.
Editing System
We designed a tiered editing model that allows natural language edits at three granularities:
- Global (theme, typography, color)
- Slide-level (layout, structure)
- Element-level (text, emphasis, sizing)
These edits are compiled into deterministic transformations on the slide tree, ensuring predictability and reversibility.
Rendering Engine
For video export, we built a distributed rendering system using Remotion:
- Render jobs are queued via RabbitMQ.
- Stateless Node.js workers consume jobs and render videos.
- Assets and outputs are stored in S3-compatible object storage.
- Redis is used for real-time job status tracking and cancellation.
This design allows horizontal scaling and non-blocking user interactions.
Challenges We Faced
Balancing Creativity and Control
One of the hardest problems was allowing AI to be creative without making the output uneditable. We solved this by enforcing strict schemas and layout constraints while giving the AI freedom within those boundaries.
Latency and User Experience
AI generation and video rendering both introduce latency. Keeping the editor interactive required careful decoupling of synchronous UI actions from asynchronous background jobs.
Consistency Across Edits
Natural language edits can be ambiguous. Ensuring that repeated edits behaved consistently required building an internal representation that the AI could reason about and modify deterministically.
Debugging Distributed Systems
Running AI workers, render workers, queues, and storage locally introduced operational complexity. Docker Compose was critical in making the system reproducible and hackathon-ready.
Outcome
By the end of the project, OpenScenes successfully demonstrated:
- End-to-end AI-driven presentation generation
- Natural language editing at multiple levels
- Scalable, distributed video rendering
- A real-time, interactive editor
More importantly, it proved that presentations can be treated as programmable artifacts, not static files.
OpenScenes represents our exploration into what creative tools look like when they are built AI-first, with systems thinking at their core.
Built With
- gemini
- remotion
- typescript

Log in or sign up for Devpost to join the conversation.