Inspiration

AI has become excellent at reasoning, summarizing, and suggesting.
But when it’s time to act—open a CRM, navigate a legacy UI, verify pricing, prepare a presentation—execution is still trapped behind keyboards and screens.

While many API-driven and RPA systems try to solve this, much work happens without APIs or cannot be scripted in advance.

The inspiration for Proxi came from real moments: negotiating pricing in a meeting without a laptop, knowing the data existed on an office computer but couldn’t be accessed in time.

We realized the real gap wasn’t intelligence—it was trustworthy execution.
AI could already think. What was missing was a way for AI to act on real systems without lying, breaking things, or bypassing human control.


About the Project

Proxi is a Gemini-powered, OS-level execution platform that performs verified actions on real computers.

Unlike typical agents that claim success, Proxi proves what actually happened.

From a phone or browser, a user can delegate work such as:

  • Navigating real desktop applications and browsers without prior hardcoded instructions
  • Exploring system, application, and internal tools without APIs
  • Collecting information across multiple screens and reasoning about next steps
  • Preparing artifacts and taking real actions (e.g., modifying existing PPTs, sending emails)
  • Receiving screenshot-based evidence for every critical step

Proxi is designed for the real world:

  • Detects failures and self-corrects instead of hallucinating success
  • Enforces safety through non-bypassable execution policies
  • Adapts to OS constraints such as locked desktops
  • Keeps humans in control of all impactful actions

Proxi is not an autonomous replacement for people.
It is a control plane for trustworthy AI execution.


How We Built It

Proxi is built as a hybrid reasoning + execution system.

Gemini 3 is used as the reasoning engine to:

  • Understand user intent and create an actionable plan
  • Interpret unfamiliar desktop interfaces
  • Decide navigation paths dynamically
  • Recover from incorrect assumptions
  • Synthesize information across multiple screens

Additional components:

  • A backend execution agent runs directly on the target system
  • UI interaction uses real mouse, keyboard, and scrolling events
  • Screenshots are continuously captured and returned as evidence
  • When a desktop is locked, Proxi automatically falls back to terminal-only execution

Gemini is not used merely to generate scripts.
It is used to reason inside real environments.


Trust by Design

Most agents decide success themselves. Proxi does not.

Human-in-the-Loop Feedback

  • Users can request additional data to guide next steps
  • Plans adapt based on user feedback
  • Users can execute safe system commands directly (e.g., !pwd)

Verified Execution

  • Command outputs and screenshots as evidence
  • Visual confirmation returned to the user
  • No “agent said it worked”

Safety & Control

  • Safe actions: auto-allowed
  • Sensitive actions: require human approval
  • Destructive actions: permanently blocked

Proxi never decides success. Reality does.


Challenges We Faced

Preventing hallucinated success
Many agents report completion even when execution fails. We solved this by requiring real-world evidence before marking tasks complete.

Balancing autonomy and safety
Instead of approving every click, we applied approvals only to impactful actions.

Operating under real OS constraints
Desktop automation requires active sessions. Proxi detects locked states and adapts instead of breaking.

Reliability over flash
We prioritized correctness, transparency, and recovery over speed or visual polish.


What We Learned

  • Execution without verification cannot be trusted
  • Humans trust proof, not explanations
  • AI agents must adapt to constraints, not ignore them
  • Gemini excels at multimodal reasoning, not just text generation

Why Gemini 3 Matters

This project would not be possible without Gemini’s multimodal reasoning capabilities.

Gemini enables Proxi to:

  • Translate user intent into system tasks
  • Navigate unfamiliar desktop UIs
  • Explore multiple paths dynamically
  • Detect and recover from failures
  • Reason across visual context instead of relying on brittle automation

Proxi demonstrates Gemini as an execution brain, not merely a conversational model.


What’s Next

Proxi is intentionally focused on trustworthy execution first.

Multi-System Orchestration

Coordinated execution across multiple systems while preserving verification and safety.

UI-Level Safety Policies

Extending safety checks and approvals to high-impact UI actions (e.g. form submissions, record changes, email sending), without introducing per-click friction.

Secure Execution on Locked or Unattended Desktops

Enabling safe, policy-controlled execution when desktops are locked or unattended, while maintaining verification, auditability, and human control.

API + UI Hybrid Execution

Dynamically choosing the safest execution path when APIs exist—or falling back to UI execution when they don’t.

Unscriptable Workflow Support

Designed for workflows that are non-deterministic, UI-driven, exception-heavy, and context-dependent.

Vertical Specialization

Desktop & IT Support

  • Guided troubleshooting
  • Evidence-backed diagnostics
  • Safe remediation under approval

Security Operations

  • Incident triage
  • Evidence collection
  • Non-destructive containment
  • Human-in-the-loop enforcement

Verifier & Audit Layers

Independent verification, audit trails, and replay for compliance-sensitive environments.

Proxi’s roadmap is not about making agents more autonomous.
It is about making execution reliable, provable, and governable.

Built With

Share this project:

Updates