Inspiration

As AI agents become increasingly integrated into critical systems like banking and healthcare, the question of safety before deployment becomes paramount. I was inspired by the Datadog Challenge's focus on LLM observability and the real-world need to test AI agents at scale before they handle sensitive operations. The inspiration came from seeing how traditional software testing doesn't translate well to AI systems—you can't just write unit tests for unpredictable LLM behavior. I wanted to build a platform that could battle-test multiple AI agent variants simultaneously against real-world security threats, with full observability into their decision-making processes.

What it does

Shadow Agent Tournament is an enterprise AI agent evaluation platform that tests 10 agent variants with different safety levels (STRICT, BALANCED, RELAXED) against three real-world banking compliance scenarios:

  • High-Value Wire Reversal ($125,000): Tests unauthorized access and amount limit breaches
  • KYC/PII Extraction Attempt: Simulates social engineering attacks
  • Unsupervised Refund Loop: Detects pattern-based fraud attempts

The platform provides:

  • Real-time Observability: Streams LLM telemetry to Datadog (RUM, Logs, Session Replay)
  • Intelligent Analysis: Uses Google Gemini AI to explain security blocks and suggest remediation
  • Multi-Agent Risk Detection: Tracks data sharing between agents and visualizes chained violations
  • Actionable Incidents: Automatically creates Datadog incidents with full context for AI engineers
  • Live Dashboard: Real-time monitoring of agent behavior, violation heatmaps, and decision trees

How we built it

I built this as a full-stack TypeScript application with a focus on real-time observability and AI integration: Frontend (React + TypeScript + Vite)

  • Built a real-time dashboard using React with TypeScript for type safety
  • Implemented live tournament simulation with parallel agent execution
  • Created interactive visualizations: decision trees, violation heatmaps, data flow graphs
  • Integrated Datadog Browser SDK for Real User Monitoring (RUM) and Logs

Backend (Express.js)

  • Built a lightweight Express server to proxy Datadog API calls securely
  • Implemented incident creation endpoints that forward security blocks to Datadog
  • Added graceful error handling to prevent API failures from blocking the tournament

Key Integrations:

  • Datadog Browser SDK: Streams LLM telemetry, logs, and session replays in real-time
  • Datadog API: Creates actionable incidents when security violations occur
  • Google Gemini API: Provides AI-powered forensic reasoning and remediation suggestions
  • Custom Rate Limiter: Implements token bucket algorithm to respect Gemini API quotas (8 RPM)

Technical Highlights:

  • Permission-based guardrails using JSON schema validation
  • Multi-agent data flow tracking to detect PII exposure and sharing
  • Chained violation detection for multi-hop security risks
  • Shadow replay mode to test policy improvements before/after

Challenges we ran into

  1. Gemini API Quota Management: The free tier has strict limits (20 requests/day, 10 RPM). I built a custom rate limiter with request queuing to handle bursts gracefully, and implemented smart quota conservation by only calling Gemini for high-risk incidents.
  2. Tournament Completion Logic: Initially, the tournament would hang because all 10 agents run in parallel and completion detection was unreliable. I fixed this by implementing a completion counter and adding a safety timeout to prevent infinite hangs.
  3. Datadog API Integration: The backend API calls to Datadog would sometimes fail or timeout, blocking the tournament. I implemented graceful degradation with 5-second timeouts and fallback logging, ensuring the tournament always completes even if Datadog is unavailable.
  4. Model Compatibility: Initially tried gemma-3-12b which returned 404 errors—discovered it's not available in the v1beta API. Switched to gemini-2.5-flash-lite which works reliably.
  5. Multi-Agent Data Flow Visualization: Detecting and visualizing PII sharing between agents required building a graph data structure to track data exposure and forwarding, with special handling for scenarios that don't involve payments.

Accomplishments that we're proud of

  • Complete Datadog Integration: Successfully streams LLM telemetry, creates actionable incidents, and provides real-time observability—all requirements for the Datadog Challenge met.
  • Production-Ready Error Handling: The application gracefully handles API failures, quota exhaustion, and network issues without crashing, demonstrating real-world reliability.
  • Intelligent AI Analysis: Gemini provides contextual, compliance-focused explanations (AML, KYC, fraud prevention) rather than generic responses, making it actually useful for security engineers.
  • Multi-Agent Risk Detection: Built sophisticated logic to detect chained violations and visualize data flow between agents, something that traditional testing tools don't address.
  • Beautiful, Functional UI: Created a polished dashboard with real-time updates, interactive visualizations, and smooth animations that makes complex security data easy to understand.

What we learned

  • LLM Observability is Critical: Understanding how AI agents make decisions in production requires specialized telemetry—standard application monitoring isn't enough.
  • Rate Limiting Strategies: Learned to implement token bucket algorithms and request queuing to handle API quotas efficiently, a skill that's essential for production AI applications.
  • Graceful Degradation: Building systems that work even when external services fail is crucial—users shouldn't see errors just because an API quota is exhausted.
  • Multi-Agent Security: Discovered that security risks in AI systems aren't just about single agents—data sharing between agents creates new attack vectors that need specialized detection.
  • Real-Time Data Visualization: Building live dashboards that update smoothly while processing complex multi-agent simulations requires careful state management and performance optimization.

What's next for Shadow Agent Tournament

  • Expanded Scenario Library: Add more compliance scenarios (HIPAA, PCI-DSS, GDPR) to test agents across different regulatory domains
  • Agent Training Mode: Allow users to upload custom agent configurations and test them against the scenario library
  • Automated Policy Generation: Use Gemini to automatically generate guardrail policies based on incident patterns
  • Integration with CI/CD: Create a GitHub Action that runs agent tournaments as part of the deployment pipeline
  • Advanced Analytics: Add machine learning models to predict which agents will fail before running the tournament
  • Team Collaboration: Add features for security teams to review incidents together and vote on policy changes
  • Production Deployment: Deploy as a SaaS platform so teams can evaluate their agents without setting up infrastructure

Built With

Share this project:

Updates