4D2³ - Video to Cubed: 3D Gaussian Splatting with Scene Editing

Inspiration

As undergraduate researchers in computer vision, we were inspired by the challenge of transforming ordinary video footage into immersive, interactive 3D experiences that not only capture the geometry but also understand the semantic content of scenes. The goal was to bridge the gap between 2D video content and truly intelligent 3D representations that can be edited, manipulated, and understood at a semantic level.

Our inspiration came from the practical need to for 3D scene understanding for applications ranging from virtual reality content creation to touring a campus or finding an apartment. We wanted to create a system that could take any video (4D) and transform it into a 3D (2^3) experience with intelligent object recognition and interactive editing capabilities. That way, end users can interact with a digital twin in place of the real world, allowing experimentation and exploration.

What it does

4D2³ (Video to Cubed) is an 3D Gaussian Splatting system that transforms 2D video sequences into interactive, semantically-aware 3D scenes. The system builds on top of a state-of-the-art computer vision method, EgoLifter, to create immersive 3D experiences with editing capability.

Key Features:

  • 3D Reconstruction: Converts video footage into 3D Gaussian splats using COLMAP-based structure-from-motion
  • Semantic Scene Understanding: Integrates Segment Anything Model (SAM) for automatic object detection and segmentation
  • Interactive 3D Editing: Provides GUI tools for object selection, deletion, and manipulation in 3D space
  • Multi-Scene Visualization: Supports simultaneous viewing of multiple 3D scenes with independent viewers
  • Real-time Rendering: Delivers an interactive 3D experiences with modern WebGL-based visualization using Viser.

How we built it

The 4D2³ system is built on a pipeline from EgoLifter combining multiple technologies:

Core Architecture:

  • 3D Gaussian Splatting: Implemented using gsplat library for efficient 3D point cloud representation
  • COLMAP Integration: Automated camera pose estimation and sparse 3D reconstruction
  • SAM Integration: Segment Anything Model for semantic object detection and segmentation
  • PyTorch Lightning: Scalable training framework with distributed computing support
  • Viser Viewer: Modern WebGL-based 3D visualization with interactive controls

Technical Stack:

  • Backend: Python, PyTorch, CUDA for GPU acceleration
  • 3D Processing: COLMAP, gsplat, custom Gaussian manipulation algorithms
  • Computer Vision: SAM, GroundingDINO, CLIP for semantic understanding
  • Frontend: HTML5, CSS3, JavaScript with WebGL rendering
  • Infrastructure: Multi-process architecture supporting concurrent viewers

Development Process:

  1. Data Pipeline: Automated COLMAP processing for camera pose estimation
  2. SAM Integration: Real-time object detection and segmentation across video frames
  3. 3D Training: Custom Gaussian splatting with contrastive learning objectives
  4. Interactive Interface: Web-based 3D viewer with intuitive editing controls
  5. Multi-Viewer System: Concurrent visualization of multiple scenes

Key Technical Innovations:

  • Custom buffer management for PyTorch registered tensors
  • Modified the pipeline to utilize a smaller SAM model, allowing the pipeline to run on a laptop
  • Multi-port viewer architecture for independent scene visualization
  • Dynamic HTML generation for flexible user interfaces

Challenges we ran into

Technical Challenges:

  1. CUDA Compilation Issues: Encountered complex CUDA environment conflicts requiring custom workarounds and environment variable management for proper gsplat compilation.

  2. PyTorch Buffer Management: Faced critical IndexError issues with registered PyTorch buffers during Gaussian model initialization

  3. WebGL Context Conflicts: Struggled with browser limitations when embedding multiple 3D viewers in iframes, leading to WebGL context conflicts and performance degradation.

  4. SAM Integration Complexity: Required extensive environment setup with multiple dependencies (GroundingDINO, SAM, Tag2Text) and careful path management for detection file generation.

Accomplishments that we're proud of

Technical Achievements:

  1. Successful Multi-Scene Training: Successfully trained models on diverse scenes including indoor objects (chair), outdoor campus environments (Penn), and complex urban settings (Drexel) with a combined 1500+ camera viewpoints.

  2. Buffer Management: Solved PyTorch buffer initialization issues with drop in fixes for proper tensor resizing.

  3. Multi-Viewer Architecture: Created a multi-process system supporting three independent 3D viewers with automatic process management and clean shutdown procedures.

  4. UI/UX: Designed a landing page with descriptive GIFs layout and custom addition to the Viser GUI

What we learned

Technical Insights:

  1. CUDA Environment Complexity: Learned intricate details of CUDA compilation, environment variables, and cross-platform compatibility challenges.

  2. Multi-Process Architecture: Developed expertise in Python process management, inter-process communication, and resource coordination.

What's next for 4D2³

  1. Enhanced Editing Tools: Develop more sophisticated 3D editing capabilities including per-object transformation, scaling, and advanced selection tools.

  2. Mobile Support: Optimize the system for mobile devices with WebGL compatibility and touch-based interaction.

Built With

Share this project:

Updates