Inspiration
Millions of users cannot reliably use a mouse or keyboard due to motor impairments, repetitive strain injuries, or temporary conditions. Most existing accessibility tools rely on a single input modality, forcing users to adapt to rigid interaction styles.
At the same time, voice assistants lack awareness of what’s happening on the screen, saying “click that” means nothing without visual grounding.
We wanted to build a system that adapts to the user, not the other way around, by combining vision, voice, and alternative input methods into a single, remote interaction loop.
What it does
Our system enables remote computer control by combining real-time screen understanding with natural user commands.
- Captures the current screen and detects UI-like regions (buttons, inputs, links) using computer vision
- Overlays short, deterministic visual hints on detected regions
- Allows users to:
- Move the cursor using eye tracking
- Select targets using hand gestures
- Issue commands using natural voice input
- Interprets voice commands in the context of what’s on screen to decide what action to take and where
- Runs entirely locally, with no cloud calls required for detection or selection
How we built it
- Python + Pillow for fast screen capture
- OpenCV (Canny edge detection, contour extraction, geometric filtering) for UI region proposals
- Deterministic hint generation using compact Cartesian products to keep hints short and predictable
- Two overlay implementations:
- A lightweight GTK/Cairo overlay for minimal overhead
- A pygame fullscreen fallback to avoid heavy dependencies in constrained environments
- A lightweight GTK/Cairo overlay for minimal overhead
- pynput for mouse control and execution
- Clean data models (enums + dataclasses) to represent UI elements, states, and actions
- Tunable geometric filters (size, aspect ratio, caps) to reduce noise and stabilize region ordering
Challenges we ran into
- Dependency conflicts between NumPy and OpenCV in shared and pre-configured environments
- STT model did not pick up our voices properly
Accomplishments that we're proud of
- A complete end-to-end local interaction loop: capture → detect → overlay → select → click
- No reliance on cloud services for the core accessibility pipeline
- Deterministic hint ordering and limits, which made the system reliable during live demos
- A modular design that allows voice, vision, and gesture components to evolve independently
What we learned
- Accessibility tools must prioritize predictability and trust, not just raw intelligence
- Multimodal systems are far more usable when each input reinforces the others
What's next for Triad
- Add OCR over detected regions and combine transcript + region text for semantic targeting
- Integrate voice triggers to enter selection mode and pick hints verbally.



Log in or sign up for Devpost to join the conversation.