Agent Unblind

Inspiration

More than 1.8 million business owners with disabilities in the U.S. often must overcome unique barriers to entrepreneurship as they strive to compete alongside their non-disabled counterparts in a business world that does not fully recognize their abilities and resilience.

Source: https://www.nationaldisabilityinstitute.org/reports/small-business-ownership-pwd-challenges-and-opportunities/

I personally don't believe in "ability" and "disability", people that are blind or "disabled" just need a curated tool or pool of tools to enable themselves to do anything. I would not call someone "disabled" just because he cannot stick a nail to the wall without a hammer. People generally need "tools" to do tasks.

"Disabled" or blind people specifically just need more customized tools. That's why I developed Agent Unblind that uses Text-to-Speech and Speech-to-Text to interact with the tool calling/invoking agent built with ASI LLM as reasoning brain that calls a chain of tools to do tasks.

I built customized tools such as signing up with credential, logging in, getting my customer list, sending email to customer, generating invoice and sending an email with an invoice, Also, it uses camera to analyze a photo that is taken after opening the camera.

I also built a theft detection system with Free-WILi that uses accelerometer data streamed to our Dashboard app through WASM (Rust -> wasm-bindgen) and real-time data visualization of acceleration data of x, y and z.

What it does

Agent that handles sending email to customer, generating invoice, and take a photo and analyze the photo along with prompt, auto photo capturing system of thief upon our motion detection.

Detects motion in a local WASM web app (port 7001) and snaps a photo to media/.

Analyzes the most recent frame with Gemini to describe clothing, visible traits, and scene context (“1 person; black hoodie; white sneakers; backpack; near the register”).

Speaks alerts using ElevenLabs TTS (“Motion detected. Theft alert!”).

Agentic workflows via ASI: you can say “sign me up”, “log in”, “add a customer”, “email the incident”, or “analyze the latest frame” and the agent calls our FastAPI tools automatically.

Customer & auth layer with MongoDB: sign up, log in (JWT), then create/list your customers—scoped per user.

Utilities: generate PDF invoices, send emails with attachments, run an ARM SME2 static analyzer (demo of domain tools).

How we built it

Architecture

Frontends

localhost:7001 (Flask/WASM) — motion sensing, live capture, media hosting, and a /capture_now endpoint.

“Agent Un-blind” UI — a simple Bootstrap page with mic capture, uploads audio to /transcribe_and_run, shows transcript + agent reply (and plays TTS).

Services

localhost:8000 (FastAPI “Tools API”) — endpoints for:

/tools/signup, /tools/login, /tools/get_myid (JWT → user scope)

/tools/put_customer, /tools/get_customer

/tools/make_invoice, /tools/send_email, /tools/send_invoice_email

/tools/camera_analyze (webcam → Gemini)

/tools/analyze_thief_most_recent (latest snapshot → Gemini)

/tools/arm_sme2/analyze (mounted SME2 analyzer)

JWT auth: HS256 with iss/aud, helper to derive user_id from Bearer .

MongoDB: users (bcrypt hashes, unique indexes), customers (scoped by user_id + email).

Agent layer

agent/asi_toolcaller.py — sends chat + OpenAI-style tool schema to ASI, executes returned tool calls against :8000, persists JWT in session, auto-injects JWT for scoped tools, exposes a shim on :9100 for the voice UI.

Uses uAgents to run locally on port :9200, with Agentverse inspector enabled.

AI integrations

Gemini (google-genai) for image understanding (JSON schema when possible).

ElevenLabs for STT (scribe_v1) and TTS (multilingual v2).

port map:

5000 - Our main "Agent Unblind" Web app

7001 - Free-WILi WASM Dashboard App

8000 — Tools API (FastAPI)

9200 — uAgents internal server

9100 — Local “shim” HTTP (optional)

A tiny FastAPI endpoint we added (POST /run_task) so you can trigger the agent’s ASI tool-calling loop via curl without wiring agent-to-agent messages. It’s just a dev convenience. You can change/disble it:

Use of uAgents

Runs a local agent (asi-toolcaller) that:

Maintains conversation state and tool history.

Stores JWT + user_id after login and auto-injects for later scoped calls (get_customer, put_customer, analyze_thief_most_recent).

Exposes a message handler and a lightweight HTTP shim for the voice UI.

Clean separation of concerns: the agent focuses on policy + tool orchestration, while the FastAPI service focuses on capabilities.

Use of Agentverse

Use of ASI

We call https://api.asi1.ai/v1/chat/completions with our tools schema (OpenAI-style) and tool_choice="auto".

The model: asi1-mini

Interprets natural language requests (“Sign me up”, “Log in”, “Analyze the latest scene”).

Returns structured tool calls; we execute them and feed results back as role:"tool" turns.

We added missing tools (signup, login, get_myid, put_customer, get_customer, analyze_thief_most_recent) to the schema so ASI reliably invokes them.

Use of Free-WILi

Motion Sensor + WASM Dashboard + Voice Alert Overview

I built a browser-based theft-detection dashboard that reads live accelerometer data from a FREE-WILi device over Web Serial, processes the signal in a Rust wasm-bindgen module for fast/consistent detection, visualizes the data (m/s² + EMA), and triggers an ElevenLabs TTS alert that says: “Motion detected, theft alert!” whenever movement crosses a threshold.

Click Image to see Demo of Theft Prevention System powered by Free-WILi and WASM

Architecture (end-to-end)

FREE-WILi streams raw accelerometer counts (x, y, z) and the current LSB-per-g scale over serial USB.

Browser (Web Serial API) reads those lines in real time.

Rust → WebAssembly (wasm-bindgen) converts counts → m/s², maintains an Exponential Moving Average (EMA) of |a|, and raises an alarm when |a| deviates from 1g by ≥ threshold.

Dashboard (HTML/JS) plots ax, ay, az, |a|, EMA, updates status badges, and logs events.

ElevenLabs voice alert fires via a small /speak-motion endpoint (so the API key stays server-side) and plays back in the browser.

Motion detection logic (why it works)

Gravity at rest is ~1 g. Any significant motion changes the magnitude |a| = √(ax² + ay² + az²).

Convert counts → g using the device’s LSB per g, then g → m/s² by multiplying 9.80665.

Smooth with EMA to reduce noise but keep responsiveness:

ema(t) = α * |a|(t) + (1-α) * ema(t-1)

Alarm if | |a| - 9.80665 | ≥ thresh (e.g., 2.0 m/s²).

Rust/WASM core

Rust gives deterministic math and great performance in the hot loop.

Build:

cargo install wasm-pack cd accel_wasm wasm-pack build --release --target web -d web/pkg

cd accel_wasm python app.py --port 7001 --web-root web

Browser data pipeline

User clicks “Connect” → Web Serial prompts for the FREE-WILi port.

We read newline-delimited frames like: x,y,z,lsb_per_g.

Feed each frame into Detector.process_counts(...).

Update the chart + status; if alarm=true, call /speak-motion?text=Motion%20detected%2C%20theft%20alert!.

Use of ElevenLabs

Theft Detection Voice is done with ElevenLabs and the Text-to-Speech and Speech-to-Text in the Main Agent Unblind app is also done with ElevenLabs.

Use of ARM

Accelerate Matrix Multiplication Performance with SME2 Laptops and Desktops Learning Paths

SME2 static analyzer: We built a FastAPI microservice that compiles tiny AArch64 kernels with LLVM clang (Armv9-A target) and evaluates them with llvm-mca to get IPC and block throughput. It compares a naïve GEMM baseline against an SME/SME2 streaming-mode variant, returning a JSON report plus optional raw MCA output for deep dives.

Neoverse/Cortex targeting: The service takes -mcpu (e.g., neoverse-v2, neoverse-n2, cortex-x4) so we can see how the same kernel schedules across Arm cores. This helped us reason about expected gains and bottlenecks before touching real hardware.

Armv9-A & SVE2 path: Alongside the baseline and SME2 stub, we added an autovectorized SVE2 variant (using -march=armv9-a+sve2 and optional -msve-vector-bits=) to compare baseline vs SVE2 vs SME2 side-by-side.

Streaming mode + ZA plumbing: We explicitly enter/exit SME streaming mode and toggle ZA (smstart sm/za … smstop za/sm) to validate state handling and ensure our future SME2 micro-kernels will model correctly.

Developer workflow on macOS/Linux: We used Homebrew LLVM (clang/llvm-mca 21.x) with -target aarch64-linux-gnu for cross-assembly generation, making it easy to test Arm scheduling locally without an Arm board.

Tooling as an agent tool: The analyzer is exposed as /tools/arm_sme2/analyze and our uAgents agent can call it in chains (e.g., “run SME2 analyzer mcpu=neoverse-v2 n=256”), turning Arm perf exploration into a composable tool inside the broader agent system.

Actionable output: The endpoint returns:

ipc, block_rthroughput, and (optionally) top resource pressures

A computed predicted speedup vs. baseline

Optional raw llvm-mca text for profiling/port pressure inspection

Example call:

curl -s -X POST http://127.0.0.1:8000/tools/arm_sme2/analyze \ -H "Content-Type: application/json" \ -d '{"n":256,"mcpu":"neoverse-v2","march":"armv9-a+sve2","include_raw_mca":false}' | python -m json.tool

TLDR

This tool is comparing different implementations of GEMM (General Matrix Multiply):

Baseline (naïve C code) — simple nested loops, scalar multiply-add.

SVE2 variant — auto-vectorized or hand-tuned for scalable vector extensions.

SME/SME2 variant — uses Arm’s Streaming Matrix Extension, which provides a 2D register file (ZA) and streaming mode for block-matrix math.

This lets you see how much faster SME2 can theoretically execute GEMM compared to the baseline or SVE2 versions.

This tool can be called from our agent.

ARM Matrix Multiplication Performance Comparison Tool through directly calling the tool

Use of Gemini

We built a tool using Gemini 2.5 that takes a prompt along with an image and gives response from Gemini. Agent is calling this tool to open a camera and take a photo and convert it into base64 to get response when user prompt asks image analysis.

Also, we built thief_analyzer tool that uses the most recent photo taken when the Free-Wili sensor detected motion/acceleration. The Theft detection powered by Free-WILi promptly opens a camera and take a photo when detecting motion. thief_analyzer tool takes the most recent photo to the Gemini for analysis, you can give prompt such as "describe the person in the thief photo and his physical appearance".

Use of Gemini 2.5 Computer Use

Gemini 2.5 Computer Use - No Quota on Free tier!

MODEL = "gemini-2.5-computer-use-preview-10-2025"

🧠 What We Built

We experimented with Google Gemini 2.5 Computer Use, a new agentic model that can literally see and control user interfaces. Our goal was to let Gemini autonomously interact with our FREE-WILi Theft Detection WASM Dashboard (served at http://localhost:7001/).

In short:

🧩 We wanted Gemini 2.5 Computer Use to open the dashboard, click “Connect Device”, select the correct g-range, read live motion data, and trigger a voice alert via ElevenLabs when motion was detected — all automatically.

⚙️ How It Was Set Up

We wrote two Python entry points:

main.py

A full Computer Use agent loop built from Google’s reference code. It used:

google-genai client to talk to the model gemini-2.5-computer-use-preview-10-2025

Playwright to actually perform the clicks, typing, and screenshot capture

Flask backend serving the local dashboard and /speak-motion endpoint (for TTS alerts)

Loop flow:

Send Gemini a goal prompt and a screenshot of the web UI.

Receive the model’s function_call (e.g., click_at, type_text_at).

Execute those actions in Playwright.

Capture a new screenshot and feed it back to the model.

Repeat until Gemini completes the task.

Essentially, main.py let Gemini become a test engineer for our IoT dashboard.

tiny.py

A minimal proof-of-concept version that runs a single iteration — the smallest possible working example of Computer Use. It only asked Gemini to:

“Open http://localhost:7001/ and click Connect Device.”

No complex safety or confirmation handling, just one round-trip to show the API flow.

🚧 What Happened

The code executed correctly up to the API call.

But our Google AI Studio project had no free quota for the Computer Use Preview model, so the API returned 429 RESOURCE_EXHAUSTED.

Because of that, Gemini never generated real UI actions — we could only test the Playwright execution layer locally.

🧩 Fallback Solution

While waiting for access, we built a local simulation using Playwright only (no API). It clicks the same buttons and reads the same DOM elements, mimicking what Gemini 2.5 Computer Use would do once quota is enabled.

💡 Summarize

We integrated Google’s new Gemini 2.5 Computer Use model to let an AI agent control our FREE-WILi WebAssembly motion-sensor dashboard in real time. Using Python (Flask + Playwright) and Gemini’s Computer Use API, the agent was designed to open the dashboard, connect the sensor, monitor live motion data, and trigger ElevenLabs voice alerts. Although our current project ran into preview-quota limits, the full loop demonstrates how Gemini can autonomously interact with real web interfaces — an important step toward hands-free IoT control agents.

🧩 What We Built with auto_connect.py

We developed an automation script using Playwright to demonstrate the same capabilities that the Gemini 2.5 Computer Use API provides — but running locally and for free.

Our goal was to show how an intelligent agent could control a real browser to interact with our FREE-WILi WebAssembly Motion Sensor Dashboard, a live page served at http://localhost:7001/.

⚙️ What the Script Does

auto_connect.py automatically:

Launches Google Chrome in a controlled environment

Uses a persistent Chrome profile so that WebSerial permissions (for the hardware sensor) are remembered across sessions.

This simulates the secure sandboxed environment required by Gemini Computer Use.

Opens the local dashboard (http://localhost:7001/)

The same page where the motion sensor streams real-time data from the FREE-WILi board.

Clicks the “Connect Device” button automatically

The first time you run it, you manually select the serial port (DisplayCPU v62) once.

After that, the permission is remembered and the script connects without showing the Chrome serial-port chooser again.

Selects g_range = 16384 (± 2 g) from the dropdown

Ensures the accelerometer scale is correct for motion detection.

Reads the live sensor status directly from the page DOM

Prints values such as:

Connection Status

Motion Status

EMA (m/s²)

|a| (m/s²)

X/Y/Z (m/s²)

This verifies that the browser and the sensor are communicating correctly.

🧠 What We Demonstrated

This script simulates Gemini 2.5 Computer Use locally:

Instead of the Gemini API issuing click_at or type_text_at actions, Playwright performs those actions on a real Chrome browser.

We showed how an autonomous agent can see, act, and read back from a GUI — the same closed-loop logic described in Google’s official Computer Use documentation.

Challenges we ran into

One struggle was the lack of quota on Gemini 2.5 Computer Use with Gemini Free Tier. I implemented the code using "gemini-2.5-computer-use-preview-10-2025" that auto-press the button "Connect Button" of Theft Detection WASM Dashboard app, since clicking would be burdensome for our target user of blind people.

We provided alternative using playwright to automate the clicking button, but playwright was not native to chrome, so it only does auto-click not selecting port.

Also, we ran out of free credit from ElevenLabs and made another account.

Another challenge was that I gave out my ESP camera to someone in need. Little after that, I figured out my theft detection WASP app uses my webcam and my main Agent Unblind app also has a tool of opening a camera and take a photo and send it to gemini. The thing is my two apps cannot share the camera, I have to turn one off and turn another one on.

Accomplishments that we're proud of

we are proud to have implemented 1 of 9 learning paths that were available for our MacOS. We are also proud to have implemented the Gemini 2.5 Computer Use since this was new. We are also proud to utilize Free-WILi.

What we learned

We learned a lot, especially the ARM learning path, we are proud to have implemented 1 of 9 learning paths that were available for our MacOS.

What's next for Agent Unblind

We need to build more customized tools for our Tool Calling/Invoking Agent to utilize to benefit blind people.

It would be better if we develop mobile version, as well.

Built With

arm
c
dotenv
fastapi
flask
free-wili
html
javascript
jwt
mongodb
pydantic
pymongo
python
rust
uagents
wasm
wasm-bindgen