Proxi

Proxi - your trusted work execution engine
Proxi - how it works
Control multiple systems, remotely. your systems work while you rest
Gemini3's multimodal reasoning capabilities make Proxi stronger
OS aware intelligence
Proxi - trust and safety by design

Inspiration

AI has become excellent at reasoning, summarizing, and suggesting.
But when it’s time to act—open a CRM, navigate a legacy UI, verify pricing, prepare a presentation—execution is still trapped behind keyboards and screens.

While many API-driven and RPA systems try to solve this, much work happens without APIs or cannot be scripted in advance.

The inspiration for Proxi came from real moments: negotiating pricing in a meeting without a laptop, knowing the data existed on an office computer but couldn’t be accessed in time.

We realized the real gap wasn’t intelligence—it was trustworthy execution.
AI could already think. What was missing was a way for AI to act on real systems without lying, breaking things, or bypassing human control.

About the Project

Proxi is a Gemini-powered, OS-level execution platform that performs verified actions on real computers.

Unlike typical agents that claim success, Proxi proves what actually happened.

From a phone or browser, a user can delegate work such as:

Navigating real desktop applications and browsers without prior hardcoded instructions
Exploring system, application, and internal tools without APIs
Collecting information across multiple screens and reasoning about next steps
Preparing artifacts and taking real actions (e.g., modifying existing PPTs, sending emails)
Receiving screenshot-based evidence for every critical step

Proxi is designed for the real world:

Detects failures and self-corrects instead of hallucinating success
Enforces safety through non-bypassable execution policies
Adapts to OS constraints such as locked desktops
Keeps humans in control of all impactful actions

Proxi is not an autonomous replacement for people.
It is a control plane for trustworthy AI execution.

How We Built It

Proxi is built as a hybrid reasoning + execution system.

Gemini 3 is used as the reasoning engine to:

Understand user intent and create an actionable plan
Interpret unfamiliar desktop interfaces
Decide navigation paths dynamically
Recover from incorrect assumptions
Synthesize information across multiple screens

Additional components:

A backend execution agent runs directly on the target system
UI interaction uses real mouse, keyboard, and scrolling events
Screenshots are continuously captured and returned as evidence
When a desktop is locked, Proxi automatically falls back to terminal-only execution

Gemini is not used merely to generate scripts.
It is used to reason inside real environments.

Trust by Design

Most agents decide success themselves. Proxi does not.

Human-in-the-Loop Feedback

Users can request additional data to guide next steps
Plans adapt based on user feedback
Users can execute safe system commands directly (e.g., !pwd)

Verified Execution

Command outputs and screenshots as evidence
Visual confirmation returned to the user
No “agent said it worked”

Safety & Control

Safe actions: auto-allowed
Sensitive actions: require human approval
Destructive actions: permanently blocked

Proxi never decides success. Reality does.

Challenges We Faced

Preventing hallucinated success
Many agents report completion even when execution fails. We solved this by requiring real-world evidence before marking tasks complete.

Balancing autonomy and safety
Instead of approving every click, we applied approvals only to impactful actions.

Operating under real OS constraints
Desktop automation requires active sessions. Proxi detects locked states and adapts instead of breaking.

Reliability over flash
We prioritized correctness, transparency, and recovery over speed or visual polish.

What We Learned

Execution without verification cannot be trusted
Humans trust proof, not explanations
AI agents must adapt to constraints, not ignore them
Gemini excels at multimodal reasoning, not just text generation

Why Gemini 3 Matters

This project would not be possible without Gemini’s multimodal reasoning capabilities.

Gemini enables Proxi to:

Translate user intent into system tasks
Navigate unfamiliar desktop UIs
Explore multiple paths dynamically
Detect and recover from failures
Reason across visual context instead of relying on brittle automation

Proxi demonstrates Gemini as an execution brain, not merely a conversational model.

What’s Next

Proxi is intentionally focused on trustworthy execution first.

Multi-System Orchestration

Coordinated execution across multiple systems while preserving verification and safety.

UI-Level Safety Policies

Extending safety checks and approvals to high-impact UI actions (e.g. form submissions, record changes, email sending), without introducing per-click friction.

Secure Execution on Locked or Unattended Desktops

Enabling safe, policy-controlled execution when desktops are locked or unattended, while maintaining verification, auditability, and human control.

API + UI Hybrid Execution

Dynamically choosing the safest execution path when APIs exist—or falling back to UI execution when they don’t.

Unscriptable Workflow Support

Designed for workflows that are non-deterministic, UI-driven, exception-heavy, and context-dependent.

Vertical Specialization

Desktop & IT Support

Guided troubleshooting
Evidence-backed diagnostics
Safe remediation under approval

Security Operations

Incident triage
Evidence collection
Non-destructive containment
Human-in-the-loop enforcement

Verifier & Audit Layers

Independent verification, audit trails, and replay for compliance-sensitive environments.

Proxi’s roadmap is not about making agents more autonomous.
It is about making execution reliable, provable, and governable.

Built With

docker
gemini-aistudio
gemini3
node.js
opencv
pyautogui
python
pywinauto

Updates

Manoj Verma started this project — Feb 09, 2026 10:56 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.