Opticon

grid session view
tab session view
kanban board view
home page view

Inspiration

Two weeks ago someone’s OpenClaw incurred a $3k charge on his credit card after he gave it full access to his computer. 6 months ago a Replit agent deleted an entire companies database and then proceeded to lie about it. Stories like this, prompt injection attacks, data leaks, misplaced authority, made us realize how giving your machine to AI is probably a bad idea.

Naturally, we thought what if we could make it so each agent didn’t have to run locally. We decided to give each agent its own secure, sandboxed computing environment via cloud VMs. This adds a hard security wall and a way to operate with sudo access in an isolated environment.

The next step was to make them run in parallel via an orchestrator agent. We could then breakdown complex tasks into independent actions that each agent could run concurrently.

The result was advanced task management with full visibility into every click, keystroke, and window. Spawn an entire task force of agents equipped with their own computers at will.

What it does

Opticon lets users submit a prompt to orchestrate multiple agents that all individually have access to a VM. The platform’s orchestrator automatically breaks the prompt into independent subtasks and assigns each of them into an AI agent. Each agent gets access to its own full virtual machine with a browser, file system, and internet access, holding the ability to execute tasks automatically from clicking, typing, scrolling, and navigating a computer like a human would.

All of these tasks happen simultaneously, so while one agent researches in Chrome, another writes a document, and can pull API data. Users watch everything from a single dashboard, monitoring all the screens of every agent acting in real time, a sidebar of the all agent activity, and a shared whiteboard space where agents post findings and coordinate actions with each other.

From end to end:

User inputs the program
K2 intelligently reasons and decomposes the prompt into several subtasks and smartly distributes across worker subagents
Subagents spin up their own VM interface using E2B, up to 4 linux cloud desktops.
We stream the live feed of the machine to our webapp using E2B, simultaneously enabling real-time interaction
Agents coordinate through a shared whiteboard where they can easily do interprocess communication
Sessions are fully recorded for transparency, visiblity, and reliability. Saved on Neon PostgreSQL database

How we built it

Dedalus Custom Computer-Use Module: We rebuilt the computer-use framework to be compliant with Dedalus via smart tool calling. Each agent runs on a observation-based loop, capturing the screen of the VM, reasoning about its next steps comparative to its task, then performing a tool call (click(x, y), scroll(y), etc.), and then repeats.

Our architecture consists of two parts:

Next.js handles the frontend, API routes, Socket.io server, and orchestration. It uses a custom server.ts with Socket.io wired into the HTTP server.
Python agent workers are spawned as separate processes by the backend. Each worker boots an E2B cloud desktop sandbox and runs the vision loop.

Some of the tech we used:

K2 Think for core LLM orchestration using advanced reasoning to breakdown tasks
E2B Desktop SDK for cloud Linux sandboxes with built-in streaming
Dedalus Labs API for spawning subagents
Socket.io for all real-time communication (frontend <-> backend <-> Python workers)
Next.js 16 App Router with Tailwind CSS and shadcn/ui for the frontend
Neon PostgreSQL for session persistence
Flowglad for billing and subscription management
NextAuth with Google OAuth for authentication

Challenges we ran into

Our first challenge was getting the agent to actually see what is going on in the VM. We first used the DedalusRunner high level agent loop, which will stringify all tool results, including screenshots. This meant that the model would convert screenshots into raw base64 text instead of the actual images, which the model wouldn’t be able to interpret visually. We had to drop down to the Dedalus API directly and build a custom agentic loop where screenshots are injected as “image_url” content blocks in user messages.

We initially started our orchestration model using GPT 4.1, however we found it was lackluster in being able to break up complex tasks into independent ones that multiple agents could run in parallel, meaning that the agents had to wait serially for others to complete before they could start their task. We pivoted to K2 Think afterwards because of its advanced reasoning capabilities, being able to cleanly decompose our tasks into structured independent actions for every agent.

Accomplishments that we're proud of

The vision loop architecture: Building a custom observe-think-act loop that sends screenshots as actual images through the Dedalus API was non-obvious and required deep research into how the SDK handles tool results vs. user messages.

Real-time Interaction: Streaming via E2B enables user-interaction concurrently with agent work within the VM. Perfect for quick redirects/manual speed up.

Replay session: Capture screenshots at every tool use and save to a buffer that is converted into a timelapse at the end of the session.

Socket.io captures all state changes with the agents. Agents are able to interact and work with each other through a whiteboard space. They post their findings, intermediate results, and conclusions to coordinate effectively with each other to efficiently execute tasks

Push-based task assignment to avoid race conditions - backend pushed tasks to agents rather than the agent polling tasks from a single source.

What we learned

Multi-agent coordination is hard. Even with a shared whiteboard file, getting agents to meaningfully build on each other's work requires well-structured architecture and careful prompting.

Agent SDKs have tradeoffs. Dedalus gives us flexibility to swap between Claude, OpenAI, and other models, but it also means we can't use provider-specific features like Anthropic's native computer_use tool type, a capability that we had to build ourselves.

What's next for Opticon

Agent-to-agent communication beyond a whiteboard, agents should be able to directly hand off URLs, file paths, credentials, etc. to each other
Smarter orchestration logic: dynamic replanning throughout the session, agent discovery should allow the orchestrator to adapt the task breakdown
Improve the reconnect logic that we have, right now when an agent goes down it is difficult to recover
Currently local-only, next step is to have a hosted deployment