The Robot Data Project
"The internet is full of LLM data, but not robotics data" — false. We're extracting robot joint data from natural human motion: a human-to-robot embodiment gap-filling layer.
Inspiration
The common consensus in AI is that we are "running out of data" for robotics because robots don't live on the internet like text does. We disagree. The internet is actually overflowing with high-quality movement data—it's just trapped in human bodies. We were inspired by the idea that a human waving hello or picking up a glass contains the same fundamental physics as a robot doing the same, provided you have the right "translator" to bridge the gap.
What it does
The project acts as a linguistic layer between human motion and machine action. It observes natural human movements from standard video and extracts the underlying skeletal and joint intent. By filling the "embodiment gap," it allows us to take a video of a person performing a task and instantly map those physics onto a robot's specific hardware configuration, turning every YouTube tutorial into a potential training set for a mechanical arm.
How we built it
We focused on creating a sophisticated mapping system that understands spatial relationships. We looked at how humans move through 3D space and developed a framework that "retargets" those motions. Instead of trying to make a robot mimic a human exactly (which is hard because our proportions are different), we built a system that prioritizes the intent and the endpoint of the motion, ensuring the robot achieves the same result in a way that makes sense for its own frame.
Tech Stack & Architecture
The system is built as a 3-tier pipeline, entirely in the browser with a lightweight Node.js signaling server:
| Layer | Technology | Purpose |
|---|---|---|
| Tier 0 — Streaming | WebRTC + WebSocket | Low-latency video transport from phone camera to laptop |
| Tier 1 — Perception | Google MediaPipe (Vision Tasks API) | Real-time pose & hand landmark extraction |
| Tier 2 — Simulation | Three.js | 3D robot arm visualization driven by extracted joint data |
Signaling Server — A minimal Node.js + Express HTTPS server with WebSocket (ws) handles peer discovery and WebRTC offer/answer/ICE candidate exchange. Self-signed TLS certificates are generated at startup using the selfsigned library.
Video Capture & Transport (Tier 0) — The phone opens phone.html, captures the camera feed via the MediaDevices API (getUserMedia), and streams it peer-to-peer to the laptop over WebRTC using Google's public STUN server (stun:stun.l.google.com:19302) for NAT traversal. No video ever touches our server — it's direct device-to-device.
AI Perception (Tier 1) — On the laptop, the incoming WebRTC video stream is fed directly into two MediaPipe models running client-side via WebAssembly + GPU delegation:
- PoseLandmarker (pose_landmarker_lite) — Extracts 33 full-body skeletal keypoints at video frame rate
- HandLandmarker — Extracts 21 keypoints per hand with handedness classification (Left/Right)
Both models run in VIDEO mode, processing each frame through detectForVideo() on a requestAnimationFrame loop. The raw landmarks are drawn onto an overlay <canvas> using MediaPipe's DrawingUtils.
Teleop Data Pipeline (Tier 2) — Each frame's pose + hand landmarks are packaged into a structured teleop_frame object and broadcast at ~30Hz. This data drives a real-time Three.js robot arm simulation: joint angles are computed from the human skeleton's spatial relationships, mapping shoulder/elbow/wrist positions onto the robot's kinematic chain. A T-pose calibration step establishes the neutral reference frame.
APIs & Libraries
- MediaPipe Tasks Vision (
@mediapipe/tasks-vision@0.10.14) — Pose and hand landmark detection - WebRTC (browser-native) — Peer-to-peer video streaming
- WebSocket (
ws@8.16.0) — Real-time signaling - Three.js (
three@0.160.0) — 3D robot simulation rendering - Express (
express@4.19.2) — Static file serving & HTTPS server - selfsigned (
selfsigned@2.4.1) — TLS certificate generation
Challenges we ran into
The "Embodiment Gap" is deeper than it looks. Humans have a specific number of degrees of freedom, and robots often have fewer—or they are arranged in ways that would make a human snap a bone. Dealing with "self-collision" (the robot accidentally hitting itself while trying to follow a human's lead) and the difference in center of gravity were significant hurdles that required a lot of creative spatial reasoning.
Accomplishments that we're proud of
We successfully proved that you don't need a trillion-dollar lab to generate robotics data. We're proud of creating a pipeline where a simple video of a person performing a task can be translated into a usable data format for a robot in near real-time. We've effectively turned the world's library of human video into a library of robot potential.

Log in or sign up for Devpost to join the conversation.