AnyWear

Inspiration

42.5m Americans live with disability. For many, smart devices (like smart watches) and automations (like Apple Shortcuts) have become a critical avenue for reclaiming independence. They enable users to communicate with loved ones, purchase items digitally, or to control their home/workspace remotely.

Still, automations and existing voice assistants (like Siri, Google) are limited in their capabilities, making the process of piloting a device challenging for those who depend on these solutions. Simply put, users cannot translate their intentions into actions.

AnyWear is a powerful, mobile-first autonomous agent that enables users to use natural language to communicate their intents. Powered by devices that people already love -- smartphones and smart watches --, AnyWear is a first step towards reinventing how all users interact with their devices.

What it does

Using natural language, users can use AnyWear to perform multi-step mobile actions, navigate the web, and generally interact with any smart device's user interface.

By leveraging the latest AI models (Google Gemini 1.5 model, GPT 4 Vision), alongside a custom-trained user interface recognition model, AnyWear can interpret any screen with exceptional accuracy. Alongside a companion input emulator, AnyWear is not only able to "see" (understand) any page, but it's also able to act based on the user's goals.

How we built it

AnyWear is composed of 3 critical components (more details further below):

A custom-trained user interface recognition model
An orchestrator server managing our AI models (Computer Vision, Large Action Models, LLMs) and our emulators
An input emulator allowing our AI to perform actions using a Raspberry Pi

CV Model: Our Computer Vision model was trained on a dataset of 1,200 user interface images from the VINS dataset with 50 epochs. We managed to achieve a mean precision of 85%.

Orchestrator: Our "orchestrator" is a Node server that handles WebSocket connections to our companion watch app, HTTP POST Requests to our Computer Vision model, and function calls to our LLMs (Google Gemini, GPT).

Watch App The companion watch app is built with Kotlin

Emulator: We handle emulation with a Raspberry Pi using bluetooth and connect to the server with WebSockets.

Architecture diagram of AnyWear

Challenges we ran into

Setting up Bluetooth connectivity in Kotlin with WearOS was exceptionally difficult. None of us were proficient WatchOS developers, nor were we familiar with Kotlin.

Ensuring application stability with WebSockets and HTTP Requests to our AI Models was also challenging.

Accomplishments that we're proud of

Built a WatchOS app with Kotlin
Leveraged powerful new technologies like Gemini 1.5 Pro
Managing to go swimming

What's next for AnyWear

We believe in autonomous agents that will revolutionize how we use our electronic devices. AnyWear is a first step towards a future where our natural language can be plotted onto any action. We aim to implement stronger accessibility features and a wider set of capabilities for AnyWear so that it can become a universally-capable agent.

Built With

bluetooth
fastapi
gemini
javascript
kotlin
openai
opencv
python
websockets
yolov8

Submitted to

LA Hacks 2024

Created by

Orchestrator server, WebSockets, Screen Grabbing & Casting, swimming

Kevin Wu
Software Keveloper
Hardware-based input with HID emulation, LLM interaction system

Uno Pasadhika
Professional Oreo enjoyer
andrew wang
Andrew Wang

Updates

Kevin Wu started this project — Apr 21, 2024 10:49 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.