Inspiration

Software teams spend up to 40% of development time writing and maintaining test scripts. The UI changes, selectors break, tests go red — and bugs still reach production. We've lived this pain across 20+ years of enterprise software development, watching teams burn weeks fixing flaky Selenium tests that add zero business value.

When Gemini 3 launched with its massive context window and advanced reasoning capabilities, we saw an opportunity: what if AI could test your web app autonomously — understanding intent, adapting to UI changes, and healing itself when selectors break? Not another record-and-replay tool, but a genuine AI agent that reasons about what to test and how.

What it does

Ai2QA is an autonomous QA testing platform powered entirely by Gemini 3. Paste any URL, pick one of four AI personas, and watch it test — zero scripts, zero setup.

Four personas, four strategies:

  • The Performance Hawk — captures Core Web Vitals (CLS, TTFB, FCP, LCP), flags performance bottlenecks with severity ratings
  • The Gremlin (CHAOS) — chaos engineering agent that rage-clicks, tests edge cases, and exposes fragile code
  • The White Hat (HACKER) — live penetration testing for OWASP Top 10 vulnerabilities (XSS, SQL injection)
  • The Auditor (STANDARD) — methodical regression testing, validates business logic with surgical precision

Each persona has a tuned temperature (0.2–0.6) and a specialized system prompt that shapes Gemini 3's reasoning and testing behavior.

Key features:

  • Self-healing tests — when a selector breaks, the agent takes a DOM snapshot, asks Gemini 3 to locate the new element, and continues autonomously
  • Aria accessibility snapshots — instead of sending full HTML to Gemini, we use compact Aria tree representations via MCP, dramatically reducing token usage and improving response speed
  • Two-stage security — a PlanSanitizer and PromptInjectionDetector screen every AI-generated action before execution, preventing off-target navigation or malicious steps
  • Full reports — health checks, console exceptions, accessibility grades, performance metrics, severity-tagged issues, and step-by-step execution timelines

How we built it

Architecture: Hexagonal (ports & adapters) with clean separation:

Module Role
ai2qa-domain-core Pure Java domain models — no framework dependencies. Records, value objects, port interfaces (ActionQueuePort, DoneQueuePort, BrowserDriverPort), and a functional Result<T> type
ai2qa-application Business logic: AgentOrchestrator coordinates execution, StepPlanner uses Gemini 3 to decide the next action, Reflector analyzes results after each step
ai2qa-mcp-bridge MCP protocol integration — McpClient communicates via stdin/stdout to a Node.js Playwright server with tools: ClickTool, TypeTool, NavigateTool, ScreenshotTool, SnapshotTool (Aria)
ai2qa-infra-jpa H2 in-memory database with Flyway migrations, GCP Cloud SQL(PostgreSQL) on Production
ai2qa-web-api REST controllers
frontend Next.js 16 + React 19 dashboard

Tech stack: Java 21, Spring Boot 3.4, Gemini 3 via Vertex AI, MCP protocol for Chrome DevTools, Playwright for browser automation, Next.js 16 frontend.

Gemini 3 integration is pervasive — it powers the StepPlanner (deciding what action to take next), the Reflector (analyzing outcomes), the PersonaPromptComposer (shaping behavior per persona), and the self-healing loop (finding replacement selectors from DOM snapshots). Every test step involves at least one Gemini 3 API call.

Challenges we ran into

Gemini 3 rate limits in production. Under heavy testing, we hit 429 rate limits frequently. We implemented exponential backoff retry, and seeing the GCP logs show RATE_LIMITED → retry → success → test continues was both stressful and satisfying. Production resilience matters.

Aria snapshots vs. full DOM tradeoff. Full HTML pages can be 500KB+, which blows through tokens and slows reasoning. We switched to Aria accessibility tree snapshots via MCP's SnapshotTool, which gives Gemini compact, semantically meaningful element references. The tradeoff: some visual-only elements aren't in the Aria tree, so we fall back to screenshots for visual verification.

Prompt injection defense for an autonomous agent. An AI agent that navigates the open web is inherently risky — malicious pages could inject instructions. We built a two-stage security pipeline: PromptInjectionDetector scans for injection patterns, PlanSanitizer validates every planned action against allowed targets, and TargetGuardService enforces URL boundaries. This was the hardest engineering challenge — balancing agent autonomy with safety.

Obstacle detection. Real websites throw cookie banners, GDPR popups, newsletter modals, and age verification gates at you. Our ObstacleDetector maintains pattern lists for common consent buttons and dismisses them autonomously so the actual testing can proceed.

Accomplishments that we're proud of

  • Four distinct AI personas that produce genuinely different test reports on the same URL
  • Self-healing tests that survive UI changes without human intervention
  • The Performance Hawk capturing real Core Web Vitals and producing severity-tagged, actionable performance reports
  • A hexagonal architecture where the domain core has zero framework dependencies
  • Security-first design with PlanSanitizer + PromptInjectionDetector screening every AI action

What we learned

  • Gemini 3's reasoning capabilities are strong enough to drive a complex, multi-step autonomous agent — but prompt engineering for consistency across hundreds of test steps is an art
  • MCP protocol is a game-changer for browser automation — it provides a clean abstraction layer between AI reasoning and browser actions
  • Temperature tuning per persona (from 0.2 for the methodical Auditor, to 0.6 for the unpredictable Gremlin) has a dramatic effect on testing behavior and coverage
  • Building AI safety for autonomous web agents is a fundamentally different challenge than chatbot safety — the agent can act, not just speak

What's next for Ai2QA

  • Skill absorption — learning reusable testing patterns from GitHub repositories (the SkillAbsorptionService is already scaffolded)
  • Multi-page test flows — chaining autonomous tests across login → dashboard → checkout sequences
  • CI/CD integration — triggering persona-based test suites from GitHub Actions
  • Custom personas — users define their own testing personalities with custom system prompts
  • Playwright export refinement — generating production-ready Page Object Model patterns from autonomous test runs
Share this project:

Updates