About LocalGem — Gemini Hackathon Submission
A short story: what inspired this project, what I learned, how I built it, and the challenges I faced.
About the Project
The agent plans multi-step tasks, executes them step by step, verifies outcomes, retries when things fail, and pauses for user confirmation at trust boundaries such as logins and CAPTCHAs. LocalGem was built both for the Gemini Hackathon and for personal use as a single assistant that can see, reason, and act in the real world—without shipping personal data to the cloud.
What Inspired Me
I wanted one place to offload the kind of work I do every day: quick research, “remind me in 10 minutes,” “summarise this PDF,” “open Chrome and find X,” or “what’s on my calendar today?”—without juggling multiple apps or writing one-off scripts. Projects like Moltbot showed how powerful AI feels when it actually does things instead of just chatting, and that pushed me toward building a true agent rather than a prompt wrapper.
When the Gemini Hackathon came along, it was the right opportunity to build this assistant with Gemini at the core, not as a swappable model but as the engine that handles reasoning, multimodality, tool calling, computer use, and search grounding. The goal was to see how far a single, unified model could go—one API, one mental model, and one agent that feels like a coherent system rather than a patchwork of integrations.
Telegram became the natural interface. I already use it daily, and it supports text, voice, and file uploads, which allowed the agent to stay multimodal from the very first message.
What I Learned
Agent behaviour matters more than feature count. What makes a system feel intelligent is not how many tools it has, but whether it plans, executes, verifies, and recovers. Making this behaviour explicit—and implementing step-by-step execution, verification, retries, and pausing at trust boundaries—was more important than adding new skills.
Unified tool design simplifies orchestration. Exposing many capabilities through a single
execute_skill(skill_name, action, params)interface kept the orchestrator prompt and schema small. The model selects the skill and parameters, while the registry and permission layer handle execution. This made the system easier to extend without constantly rewriting prompts.Multimodality changes the architecture. Because Gemini natively handles text, images, audio, screenshots, and PDFs, there is no separate speech-to-text or vision pipeline. Voice messages are transcribed by Gemini and then routed through the same orchestrator as text. This reduced complexity and made the system more consistent.
Safety and guardrails are part of the product. Local-first execution, pausing at logins, logging high-impact actions, and avoiding irreversible steps unless explicitly requested are not afterthoughts. They are what make an autonomous agent acceptable to run on a personal machine and trustworthy in practice.
How I Built It
Telegram + Router + Orchestrator
The bot receives text, voice, or documents via Telegram. Voice and audio are transcribed using Gemini and treated the same as text. A lightweight router checks whether a request is a simple web lookup (for example, “stock price of X” or “research on Y”). If so, it uses Gemini with search grounding and responds directly without opening a browser. Everything else goes to the orchestrator.Single-Tool Orchestrator
The orchestrator uses Gemini with one tool:execute_skill(skill_name, action, params). The system prompt describes all available skills and when to use them. For browser tasks, it invokes Gemini Computer Use, passing screenshots to the model and executing returned actions. For compound workflows (such as research → document creation → delivery), the orchestrator chains multiple skills across turns and carries context forward.Skills and Registry
Each capability—browser control, document analysis, scheduling, file operations, image and video generation, system control, and memory—is implemented as a skill with a clear schema and permission level. A central registry discovers skills and builds the tool interface for the orchestrator. Sensitive skills can require explicit user approval, while a debug mode allows faster iteration.Memory and Context
Long-term user preferences and facts are stored locally in SQLite. The orchestrator receives a compact memory summary (preferences plus recent episode context) so it can personalise behaviour and avoid repeating mistakes. Short-term session context tracks recent turns and files for follow-up actions like “edit that image” or “summarise that PDF.”Safety and UX
Responses are formatted for Telegram and chunked to respect message limits. Before running browser automation that may encounter logins or CAPTCHAs, the agent pauses and asks for confirmation. All actions are logged for transparency. Aside from API calls to Gemini and Telegram, all data remains on the local machine.
Challenges I Faced
Computer use in real browsers. Reliable browser control meant handling focus issues, timing, misclicks, and layout differences. Even with the official Computer Use API, retries and verification were essential. Pausing at trust boundaries like logins and CAPTCHAs was critical for safety and judge confidence.
Orchestrator reliability. Under API hiccups or transient failures, the system needed to retry intelligently, fall back when appropriate, and still respond coherently to the user instead of failing silently or surfacing raw errors.
Telling the story. The system already had planning, verification, and recovery built in, but that wasn’t obvious from the README at first. Adding explicit sections on Agent Behavior and Safety and Guardrails helped align the written story with what the system actually does—and with what judges look for in agentic projects.
The hardest parts were dealing with real-world messiness—browsers crashing, misclicks, CAPTCHAs appearing, asynchronous tasks failing, and deciding when to trust the agent versus when to loop the user in. Rather than trying to work around these issues, I designed the agent to pause when needed, verify what actually happened, retry intelligently, and ask for help when things became uncertain.
Integration was challenging as well. I used Antigravity during development and Gemini 3.0 as the model. Some features produced errors during setup, and it took time to correctly route both built-in search and computer-use functionality through the same system. Designing effective guardrails was another important part of the process.
LocalGem is built for the Gemini Hackathon and for personal use. All data stays on the user’s machine. Use at your own risk, and keep API keys and tokens in .env only.
Log in or sign up for Devpost to join the conversation.