Pilot

what the ESP looks like
ESP attached to our iphone via dongle
initial enclosure mvp

Inspiration:

On the first night of Treehacks, we struggled to come up with an idea. At one point, a teammate was about to order The Melt on DoorDash when someone else suggested ordering directly from The Melt website instead. That sparked a conversation. We realized that while many businesses encourage direct ordering, their websites often provide a clunky user experience, even if they outsource delivery to platforms like DoorDash or Uber Eats. As a group, we started thinking about how useful it would be to have a tool that quickly checks across platforms and tells you the cheapest option. From there, we expanded the idea. What if this worked not just for food delivery, but also for ride sharing, hotel bookings, and other services where pricing varies across platforms? That led us to a broader vision: a computer use agent for iPhone. Instead of building separate integrations for every service, an agent could navigate apps and websites directly on your behalf. Price comparison would be just one use case. The same agent could create and send a Partiful invite to your contacts, transfer money through Venmo, or handle other everyday digital tasks automatically.

What it does:

Pilot is the hardware-augmented computer-use agent that operates your smartphone the same way a human would: by seeing the screen, reasoning about context, and executing physical input on the interface.

It functions as a closed-loop autonomous system: voice intent becomes real on-screen action, results are verified through vision, and the agent adapts until the task is complete.

Hardware

Bluetooth HID controller (ESP32) – Emulates a keyboard and mouse to perform real taps, keystrokes, swipes, scrolling, and system shortcuts anywhere on the device.
Dual connectivity architecture – Maintains stable BLE input while communicating with a cloud relay server over WiFi for low-latency command delivery.
Precision input execution – Calibrated keystroke and gesture timing ensures reliable interaction across dynamic UI states, including lock screen and system-level input.
Custom 3D-printed phone case housing – Securely integrates the ESP32 into a practical, daily-use form factor with seamless attachment and detachment.

AI-powered software

Always-on speech recognition (Apple Speech + VPIO) – Converts natural voice into structured intent with hardware echo cancellation for true full-duplex interaction.
System-level screen perception (ReplayKit Broadcast Extension) – Captures the entire live device display in real time, across all apps and system screens, without jailbreak.
Multimodal reasoning + tool orchestration (Gemini 3 Flash) – Interprets voice intent and visual screen context together, generates multi-step actions, invokes control tools, and adapts dynamically based on real-time outcomes.
Visual grounding (Moondream API) – Identifies UI elements from screenshots and computes pixel-accurate coordinates for precise, reliable interaction.
Cloud relay infrastructure (FastAPI + Cloudflare Tunnel) – Routes command sequences to the ESP32 through a persistent endpoint.

Pilot replaces brittle app-specific integrations with universal screen understanding and real physical control. It works across arbitrary apps, shifting interfaces, and multi-step workflows—anywhere a human can tap, type, or scroll.

How we built it:

Pilot is a computer-use agent that runs on your iPhone. It sees your screen, reasons about it with a vision-language model, and acts via an ESP32 that acts as a BLE HID keyboard and mouse. There’s no per-app API integration: the agent uses the same interface you do, screens and input.

We built an iOS app in Swift. It uses ReplayKit to capture the screen and shares the latest frame when the agent needs a screenshot. Voice input uses on-device speech recognition; we send the transcript to OpenRouter (Gemini 3 Flash), which can take a screenshot or send HID commands. We use Cartesia for voice replies.

An ESP32 pairs with the iPhone as a Bluetooth keyboard and mouse, joins the same WiFi, and connects to our relay over WebSocket. The app sends commands (e.g. type, tap, swipe) via HTTP; the relay forwards them to the ESP32, which turns them into keypresses and taps. The agent controls the phone through this bridge.

The relay runs in the cloud so the app and ESP32 can both reach it. A /plan endpoint takes a task, uses OpenRouter with app-specific prompts (Uber, DoorDash, etc.), and returns a step-by-step plan. We keep short text files that describe each app’s typical UI flow.

BLE HID on iOS needed tuning for pairing, reconnection, and keypress timing. We used one relay so the app and ESP32 don’t need hardcoded IPs, and we had to coordinate the broadcast extension with the main app and handle failed tool chains carefully.

Challenges we ran into

Building Pilot required solving tightly coupled hardware, iOS, and infrastructure constraints. BLE cannot transmit rich data and HID payloads are size-constrained, so we separated reasoning from execution. All inference runs server-side, forwarding only compact command strings to the ESP32 over a persistent WebSocket.

Full-duplex audio introduced feedback loops where the speech recognizer transcribed Pilot's own voice. We resolved this by routing recognition and TTS through a single shared AVAudioEngine, keeping the session alive across cycles so Pilot can be interrupted mid-sentence without tearing down the audio pipeline.

Screen capture on non-jailbroken iOS required a ReplayKit Broadcast Extension running in a separate process with no direct memory access to the main app. We bridged this with a shared App Group filesystem where frames are compressed, written atomically, and consumed on demand by the vision pipeline.

Action reliability was the hardest challenge. BLE HID caps cursor movement at 127 pixels per packet, iOS drops keystrokes under aggressive timing, and UI states vary unpredictably. We built a recursive tool-calling loop that executes, captures a screenshot, verifies the result, and adapts if the screen doesn't match expectations.

Accomplishments that we're proud of

Built a complete hardware + AI phone agent from scratch in one weekend.
Full-duplex voice interaction — Pilot listens, speaks, and controls the phone simultaneously.
End-to-end autonomous loop: voice → AI reasoning → screen capture → HID execution → visual verification.
Integrated Gemini, Cartesia TTS, ReplayKit, and ESP32 BLE into a unified real-time pipeline on a physical iPhone.

Pilot is a fully operational, voice-controlled phone agent that works end-to-end on a real device.

What we learned

Embedded systems + BLE — Managing dual-radio coexistence on ESP32 and WebSocket keepalive over unstable connections.
iOS platform constraints — Screen capture requires a separate system process with filesystem-based IPC, and iOS blocks all forms of absolute mouse positioning from external devices.
Full-duplex audio — Routing TTS playback alongside live speech recognition on a shared audio engine without feedback loops.
Autonomous agent design — Building recursive tool-calling loops where AI reasons, acts, verifies, and adapts.

What's next for Pilot

Our vision for Pilot’s technical evolution focuses on two key areas: expanding autonomous capabilities and broadening platform reach. We are developing real-time intent prediction using patterns in a user’s daily routines, calendar events, and usage history to proactively suggest and execute actions before they are even requested. Imagine Pilot recognizing you have a flight in two hours and automatically opening your boarding pass, checking for gate changes, and notifying your ride with your updated arrival time, all without a prompt.

On the platform side, we are architecting Pilot to be truly cross-platform, extending the same core architecture to Android, macOS, and Windows. The relay server and AI reasoning layer are already device agnostic. Any platform capable of screen capture and Bluetooth HID input can integrate into the same execution pipeline. We are also advancing visual grounding models that identify and localize UI elements directly from screenshots, eliminating reliance on prewritten app descriptions and enabling autonomous navigation across unfamiliar interfaces.

We are particularly excited about enabling multi-app workflows end to end, chaining actions across Messages, Calendar, Maps, and other applications within a single natural language request. This brings Pilot closer to a universal, device-level agent that operates seamlessly across every screen a person uses.