Triad

Inspiration

Millions of users cannot reliably use a mouse or keyboard due to motor impairments, repetitive strain injuries, or temporary conditions. Most existing accessibility tools rely on a single input modality, forcing users to adapt to rigid interaction styles.

At the same time, voice assistants lack awareness of what’s happening on the screen, saying “click that” means nothing without visual grounding.

We wanted to build a system that adapts to the user, not the other way around, by combining vision, voice, and alternative input methods into a single, remote interaction loop.

What it does

Our system enables remote computer control by combining real-time screen understanding with natural user commands.

Captures the current screen and detects UI-like regions (buttons, inputs, links) using computer vision
Overlays short, deterministic visual hints on detected regions
Allows users to:
- Move the cursor using eye tracking
- Select targets using hand gestures
- Issue commands using natural voice input
Interprets voice commands in the context of what’s on screen to decide what action to take and where
Runs entirely locally, with no cloud calls required for detection or selection

How we built it

Python + Pillow for fast screen capture
OpenCV (Canny edge detection, contour extraction, geometric filtering) for UI region proposals
Deterministic hint generation using compact Cartesian products to keep hints short and predictable
Two overlay implementations:
- A lightweight GTK/Cairo overlay for minimal overhead
- A pygame fullscreen fallback to avoid heavy dependencies in constrained environments
pynput for mouse control and execution
Clean data models (enums + dataclasses) to represent UI elements, states, and actions
Tunable geometric filters (size, aspect ratio, caps) to reduce noise and stabilize region ordering

Challenges we ran into

Dependency conflicts between NumPy and OpenCV in shared and pre-configured environments
STT model did not pick up our voices properly

Accomplishments that we're proud of

A complete end-to-end local interaction loop: capture → detect → overlay → select → click
No reliance on cloud services for the core accessibility pipeline
Deterministic hint ordering and limits, which made the system reliable during live demos
A modular design that allows voice, vision, and gesture components to evolve independently

What we learned

Accessibility tools must prioritize predictability and trust, not just raw intelligence
Multimodal systems are far more usable when each input reinforces the others

What's next for Triad

Add OCR over detected regions and combine transcript + region text for semantic targeting
Integrate voice triggers to enter selection mode and pick hints verbally.

Built With

elevenlabs
gemini
opencv
openrouter
pillow
pygaze
python

Submitted to

SpartaHack 11
- Winner Best Accessibility Hack

Created by

Wrote voice pipeline, AI integration, OS integration for linux on sway, CV integration for label clicking, and eye tracking pipeline

Nitin Shankar Madhu
Nitin Shankar Madhu
Minh Nguyen
Passionate CS student interesting in AI/ML
Lam Vu
Pedro Paiva

Updates

Minh Nguyen started this project — Jan 31, 2026 10:37 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.