Inspiration

Software AI agents are powerful, but they have a fatal flaw: they live inside the operating system. If Windows crashes, the network fails, or the computer is stuck in a BIOS boot loop, a software agent is helpless. We wanted to build an AI that exists outside the box - literally.

Inspired by the resilience of KVM-over-IP technology used in data centers, we asked: “What if we gave Gemini 3 eyes and hands?” We wanted to create an "Agentic Hardware" device that could walk up to any computer - regardless of OS, state, or software stack—and control it just like a human would: by looking at the screen and typing on the keyboard.

What it does

KaiVM is a revolutionary hardware AI agent. It connects to a target computer via HDMI (video capture) and USB (keyboard/mouse emulation).

  • Universal Control: It can interact with any machine- Windows, Mac, Linux, or even a server stuck in a BIOS menu.
  • Visual Reasoning: Using the Gemini 3 API, it captures the screen, analyzes the UI (using a custom coordinate grid), and plans actions to achieve complex goals.
  • Autonomous Repair: It can detect system crashes, navigate recovery menus, and even reinstall an operating system from scratch.
  • Smart Peripheral: It emulates a custom USB "Absolute Mouse" (digitizer), allowing Gemini to click specific UI elements with pixel-perfect accuracy, solving the "aiming" problem common in AI agents.
  • Event Watchdog: Users can set visual triggers (e.g., "If an error popup appears..."), and KaiVM will watch the screen 24/7 and react automatically.

How we built it

We built KaiVM as a tightly integrated hardware-software stack:

  1. Hardware Core: A Raspberry Pi 4 acts as the brain. It captures the target's HDMI output using a USB capture card and emulates a USB HID device using the Linux USB Gadget API (ConfigFS).
  2. The "Eyes": We used ffmpeg to capture the raw video feed and stream it via MJPEG to our web interface and the Gemini API. We built a pre-processor that overlays a 1000x1000 coordinate grid to give the model precise spatial awareness.
  3. The "Brain": The core logic is powered by Gemini 3, chosen for its incredible multimodal speed. We send screenshots and receive structured JSON plans containing keystrokes and mouse coordinates.
  4. The "Hands": We wrote a custom Python HID stack that translates Gemini's high-level intent (e.g., "Click the Start Button") into raw USB reports. We implemented an Absolute Mouse descriptor so the AI can map screen coordinates directly to HID inputs, bypassing mouse acceleration issues.
  5. Interface: A modern, responsive FastAPI & WebSockets backend serves a dashboard where users can view the live stream, chat with the agent, and schedule tasks.

Architecture

+---------------------+          +-----------------------------+          +---------------------+
|    USER DASHBOARD   |          |   KAIVM (RASPBERRY PI 4)    |          |    GOOGLE CLOUD     |
|                     |          |                             |          |                     |
|  [ Web Interface ]  |  Instr.  |    [ FastAPI Server ]       |  Image   |   [ Gemini 3 API ]  |
|                     |--------->|                             |--------->|                     |
|  < View Stream >    |<---------| (State, Logs, WebSockets)   |<---------| (Visual Reasoning)  |
|  < Send cmds >      |  Stream  |            |   ^            |   Plan   |                     |
+---------------------+          +------------|---|------------+          +---------------------+
                                              |   |
                                      Key/Mouse   | Raw Video
                                      Commands    |
                                              v   |
                                 +-----------------------------+
                                 |      HARDWARE LAYER         |
                                 |                             |
                                 |  [USB Gadget]   [HDMI Cap]  |
                                 +--------+-------------^------+
                                          |             |
                                      USB |        HDMI |
                                          v             |
                                 +-----------------------------+
                                 |                             |
                                 |       TARGET COMPUTER       |
                                 |      (BIOS / OS / Any)      |
                                 |                             |
                                 +-----------------------------+

Challenges we ran into

  • The "Relative Mouse" Problem: Standard mice send "delta" movements (+10px), which are terrible for AI because the model can't "feel" acceleration. We had to hack the Linux USB gadget drivers to present KaiVM as a digitizer tablet, allowing absolute (x, y) positioning.
  • Hardware & 3D Printing: We had absolutely no previous CAD experience, so designing the enclosure for the Pi and capture card was a massive challenge. We went through many failed prints and design iterations to get everything to fit just right.
  • Latency vs. Accuracy: Streaming high-res video to an API takes time. We had to optimize our ffmpeg pipeline and image compression to ensure Gemini saw the "fresh" state of the screen, otherwise it would react to old frames.
  • Safety & Loops: AI agents love to get stuck in loops (e.g., pressing "Enter" repeatedly). We built a "state hash" system that pauses execution if the screen hasn't changed, preventing the agent from spiraling.

Accomplishments that we're proud of

  • Hardware Independence: We successfully controlled a computer purely through hardware interfaces. No software was installed on the target machine.
  • Pixel-Perfect Aim: Achieving the "Absolute Mouse" implementation was a breakthrough. Watching Gemini look at a button and click it instantly—without dragging or guessing—feels magical.
  • Recovering a Dead PC: We successfully tested KaiVM navigating a BIOS menu, something no standard software agent (like Claude Computer Use or OpenAI Operator) can do.
  • Robust Architecture: The system is resilient. If the target PC reboots, KaiVM stays alive, watches the boot sequence, and logs back in.

What we learned

  • Multimodal is Ready: Gemini 3's ability to "read" a messy, low-resolution BIOS screen or a cluttered desktop is staggering. It requires very little "prompt engineering" to understand UI elements.
  • Hardware is Hard: Managing USB descriptors, video capture devices, and Linux kernel modules adds a layer of complexity that pure software apps don't face, but the payoff is immense.
  • Context is King: Providing the AI with the previous frame and a history of its last actions drastically improved its reasoning and reduced repetitive mistakes.

What's next for KaiVM

  • Open Source & Hardware Kits: The project is fully open source. However, we are planning to sell pre-assembled, plug-and-play KaiVM units for users who don't want to source components and print their own cases.
  • Pro Version: We are exploring a paid "Pro" software tier that includes advanced fleet management for controlling multiple KaiVM units from a single dashboard.
  • Cellular / LoRa Backhaul: Adding a 4G modem so KaiVM can be dropped into a remote data center and controlled even if the building's internet goes down.
  • Custom PCB: Shrinking the form factor from a Raspberry Pi to a dedicated "dongle" size device.

Built With

Share this project:

Updates