ACEA — Autonomous Code Engineering Agent

Live Demo: https://acea-a-checkpointed-agentic-system.vercel.app/ (TRY IT YOURSELF!)

GitHub Repository: https://github.com/saumya-200/ACEA-A-Checkpointed-Agentic-System-for-Long-Horizon-Autonomous-Codebase-Improvement

Inspiration

Modern software development faces a critical paradox: AI tools can generate code rapidly, but developers remain fully responsible for validating correctness, running tests, debugging failures, fixing regressions, and ensuring security. Existing tools like GitHub Copilot and ChatGPT operate as suggestion engines, not autonomous executors. They assist with code creation but do not own outcomes.

This creates persistent challenges, including debugging overhead, context switching fatigue, test failures after AI edits, fragile integrations, lack of decision traceability, and no recovery from partial failures.

ACEA was inspired by a fundamental question: How can we move from AI code generation to AI-driven software engineering execution? We envisioned a system where developers provide high-level objectives and receive complete, tested, secure solutions rather than code snippets that require extensive manual validation and debugging.


What It Does

ACEA is a multi-agent autonomous software engineering platform that accepts a Git repository and a high-level objective, then autonomously plans, executes, validates, and delivers complete engineering outcomes.

Core Capabilities

Intelligent Planning ACEA uses advanced AI reasoning to analyze repository structure, interpret goals, generate structured plans, and assess risk per step. It operates at the objective level rather than the code level, understanding "reduce test flakiness" or "add authentication safely" as complete missions.

Multi-Agent Execution Six specialized agents coordinate tasks through an orchestration pipeline:

  • Architect Agent — Transforms user intent into structured blueprints, detects missing configurations, and injects necessary setup files
  • Virtuoso Agent — Handles code generation, batch file creation, scaffolding, and framework-aware implementation
  • Sentinel Agent — Performs security scanning using Bandit, Semgrep, and npm audit to identify vulnerabilities
  • Testing Agents — Validates functional correctness through Pytest execution and multi-framework test rotation
  • Watcher Agent — Provides browser-based verification using Playwright and vision AI for visual analysis
  • ReleaseAgent — Handles packaging and artifact creation, including ZIP archives and downloadable builds

Tool-Driven Verification ACEA integrates real developer tools, including Pytest, Playwright, security scanners, and Git operations. It validates using actual execution rather than simulating correctness, ensuring outputs are production-ready.

Self-Healing Loop When validation fails, ACEA autonomously detects failures, diagnoses root causes, generates repair strategies, and retries with bounded iterations. It behaves like an engineer performing iterative debugging, attempting up to three repair cycles before escalating.

Long-Horizon Continuity ACEA supports checkpoint persistence, resume after interruption, and step-level state tracking, enabling multi-hour or multi-day autonomous missions.

Evidence and Traceability Every action produces audit-grade artifacts, including test logs, browser screenshots, Git diffs, artifact bundles, and thought signatures containing intent, rationale, and confidence levels.


How We Built It

Technical Architecture

Backend Infrastructure Built on FastAPI for high-performance async operations, Socket.IO for real-time bidirectional communication, and SQLite for lightweight state management. The system leverages Python 3.11+ with modern async/await patterns.

AI Integration Google Gemini serves as the primary reasoning engine for architectural planning, code modification, debug guidance, and self-healing strategies. Gemini Vision provides visual UI analysis from browser screenshots.

Agent Orchestration Custom multi-agent graph system with state machine transitions, conditional routing based on success or failure, iterative retry logic bounded at three attempts, and agent-to-agent messaging protocols.

Verification Stack

  • Testing: Pytest with multi-framework detection and parallel execution
  • Browser Automation: Playwright with Chromium, screenshot capture, and vision-based analysis
  • Security: Bandit for Python SAST, Semgrep for multi-language scanning, npm audit for Node.js dependencies

Sandbox Services

  • E2B VSCode Integration for cloud-based development environments with file synchronization and live preview URLs
  • Desktop Service providing a full IDE-like experience with noVNC remote sessions
  • Preview Proxy for semantic URL generation and session management

Development Process

The system was built in five phases: Blueprint to Reality, establishing the orchestrator and communication layer, Agent Development, creating each specialized agent with clear responsibilities, Self-Healing Loop implementation for autonomous error recovery, Sandbox Integration connecting to execution environments, and Polish and Artifacts for logging and evidence generation.

The self-healing mechanism required intelligent error parsing, context-aware fix generation, incremental repair strategies, and state management across retry cycles. Each agent was developed iteratively, starting with basic functionality and evolving to handle edge cases and integrate with the broader system.


Challenges We Ran Into

Environment Inconsistency Code that passed locally would fail in sandbox environments. Tests succeeding in E2B crashed in noVNC, and browser automation working on Mac broke on Linux. We solved this through comprehensive containerization, platform detection, environment-specific test matrices, and graceful degradation strategies.

Agent Coordination Early versions created conflicts where agents would generate code that broke existing tests or triggered security flags in each other's output. We implemented strict agent handoff protocols, validation checkpoints between agents, and feedback loops enabling agents to learn from failures.

Infinite Retry Prevention Initial self-healing logic sometimes created progressively worse code while attempting fixes. We capped retry loops at three attempts, added improvement detection to assess whether errors were getting better or worse, implemented rollback checkpoints, and created fix diversity to try fundamentally different approaches.

Visual Verification Accuracy Browser screenshots initially captured loading spinners, blank pages, and 404 errors, with vision analysis incorrectly reporting success. We added intelligent wait conditions for specific elements, retry logic for screenshots, expected state validation, and screenshot diffing against baselines.

State Management Complexity Checkpoint persistence faced race conditions with SQLite in our async architecture. Database tables were defined but underutilized due to async safety concerns, leading to heavy filesystem reliance and occasional state corruption during concurrent writes. We strategically deferred comprehensive checkpoint persistence for future development.

Security vs Speed Balance The Sentinel agent initially flagged too many false positives, while Virtuoso prioritized speed over security patterns. We created security profiles for different contexts, added context-aware scanning, implemented whitelist patterns, and built agent negotiation protocols.

Cross-Origin Resource Sharing Preview URLs and desktop previews worked independently but failed when orchestrated through the proxy due to CORS restrictions. We implemented a centralized proxy handling CORS internally, routed all external services through one gateway, and created dynamic origin whitelisting based on session tokens.


Accomplishments That We're Proud Of

Autonomous Engineering Execution We successfully built a system that completes full engineering objectives autonomously rather than just generating code snippets. ACEA delivers working, tested, secure features end-to-end.

Functional Self-Healing The self-healing loop works reliably, analyzing failures intelligently, generating targeted fixes, applying repairs autonomously, re-verifying after every fix, and trying multiple approaches before escalating.

Coordinated Multi-Agent System Six specialized agents work in harmony with clean handoffs, isolated failure domains, independent testability, and the ability to swap models per agent. This architecture enables continuous improvement of individual components without system-wide changes.

Vision-Based Browser Verification Integration of Playwright browser automation with Gemini Vision enables semantic validation beyond functional correctness, verifying visual layout, styling consistency, and UI element positioning.

Security-First Development Security scanning runs during development rather than after deployment, with every code change receiving SAST scanning, dependency vulnerability checks, pattern matching for anti-patterns, and automated fix suggestions.

Evidence-Grade Output Complete audit trails, including test output logs, before/after code diffs, screenshot evidence, security scan reports, and step-by-step execution traces, verify every decision and action.

Production-Grade Verification Real execution using actual developer tools rather than simulation ensures outputs are genuinely production-ready and thoroughly validated.


What We Learned

AI Engineering Insights LLMs excel at reasoning but struggle with precision, making specialized agents the optimal architecture. Context windows constrain how much code can be processed at once, requiring intelligent chunking and semantic search. Prompt engineering significantly impacts agent reliability and output quality. AI agents require strict guardrails, including bounded retry loops, validation checkpoints, improvement metrics, and rollback mechanisms.

Architectural Lessons Multi-agent architecture proves superior to monolithic approaches for debugging, iteration, and independent improvement. Real-time communication enables live debugging, immediate failure detection, and distributed coordination. State management in distributed async systems requires careful planning for transactional consistency and concurrent access patterns.

Testing and Verification Browser-based visual testing catches issues that unit and integration tests miss, including CSS rendering problems and responsive design failures. Automated security scanning must be integrated into the development pipeline by default. Flaky tests create more problems than consistently failing tests and require immediate attention.

Development Process Building core functionality before the user interface enables deep focus on complex backend logic. Comprehensive documentation clarifies thinking and reveals edge cases. Demo-driven development ensures continuous deliverability. Ruthless scope management enables shipping excellent core features rather than incomplete, comprehensive solutions.

Technical Decisions FastAPI's async-first architecture enabled concurrent agent operations. Socket.IO provided essential real-time observability. Gemini's reasoning capabilities proved effective for architectural planning. Playwright's modern async API simplified browser automation. Multi-agent graphs allowed independent agent evolution.


What's Next for ACEA

User Interface Development Building a comprehensive web dashboard with live mission progress visualization, side-by-side code diff viewer, expandable test output console, screenshot timeline gallery, artifact browser, and searchable mission history.

Robust Checkpoint Persistence Migrating from SQLite to PostgreSQL for async-friendly operations, implementing proper transaction management, adding incremental state snapshots, and building reliable resume logic for long-running missions.

Multi-Repository Operations Enabling ACEA to orchestrate changes across microservices, monorepos, and dependency chains with dependency graph analysis, cross-repo testing, atomic commits, and coordinated rollback strategies.

Enhanced Self-Healing Intelligence Implementing learning from failure patterns, building debugging playbooks, enabling collaborative debugging with consensus-based fixes, and developing predictive repair to detect fragile patterns before failures occur.

Advanced Security Features Adding automatic vulnerability patching beyond detection, supply chain analysis for dependency trees, runtime security monitoring, and automated rollback for suspicious changes.

Proactive Mission Planning Developing capabilities for ACEA to suggest missions based on codebase analysis, monitor code quality metrics, identify technical debt hotspots, and predict maintenance overhead.

Human-in-the-Loop Mode Creating collaborative workflows with approval gates for mission plans, interactive debugging for ambiguous situations, and combining AI speed with human judgment.

Advanced Analytics Generating detailed mission reports with time breakdowns and complexity scoring, providing codebase intelligence on change patterns and blast radius analysis, and tracking team metrics for continuous improvement.

Plugin Ecosystem Enabling custom agent development, tool integrations with project management and monitoring platforms, and expanded language support beyond Python and JavaScript.

Production Deployment Architecting for Kubernetes deployment with Redis job queuing, PostgreSQL persistence, distributed agent execution, multi-tenancy, SSO authentication, role-based access control, and comprehensive audit logging.


Summary

ACEA represents a fundamental shift from AI code assistance to autonomous software engineering execution. By coordinating specialized agents through a verification-first pipeline with self-healing capabilities, ACEA delivers complete engineering outcomes rather than code snippets requiring manual validation.

The system demonstrates that AI can operate at the objective level rather than the code level, handling full development lifecycles including planning, implementation, testing, security scanning, and artifact generation. This reduces developer debugging burden, eliminates manual verification overhead, minimizes regression risks, and increases engineering velocity while maintaining high reliability and traceability standards.

ACEA proves that long-horizon autonomous software engineering is achievable today, setting the foundation for a future where AI systems deliver verified, resilient engineering outcomes as standard practice.

Built With

Share this project:

Updates