Inspiration

Every SRE knows the pain: your dashboard shows "all green" - CPU fine, memory fine, no errors - but users are reporting the app is down. You spend the next 2 hours SSH-ing into servers, running netstat, tcpdump, strace, copying outputs into ChatGPT, and manually connecting the dots.

We saw production issues that took hours to debug manually, yet followed repeatable investigation patterns. What if we could encode expert SRE knowledge into autonomous workflows? What if AI agents could orchestrate kernel-level tools, correlate evidence across data sources, and diagnose root causes in seconds instead of hours?

That's why we built KAI - the first agentic system for autonomous infrastructure debugging.

What it does

KAI (Kernel Agentic Intelligence) is an autonomous investigation framework that orchestrates eBPF programs, system tools, and AI analysis through multi-step workflows. It provides:

  • Flow-based orchestration engine that executes multi-step investigation workflows defined in YAML
  • eBPF CO-RE integration for kernel-level observability (TCP tracing, syscall monitoring, network flows)
  • Multi-backend sensor system supporting system commands, eBPF programs, Cilium Hubble, and Tetragon
  • AI-powered correlation using Claude (with planned support for OpenAI, Gemini, and local Llama)
  • Autonomous investigation that collects data, passes context between steps, and uses AI to find root causes
  • Production-ready flows for network debugging, memory leak detection, CPU saturation analysis, and security monitoring

Instead of manually debugging, you define investigation workflows. KAI executes them autonomously - collecting kernel traces, running commands, and using AI to correlate evidence - delivering root cause analysis in seconds.

How we built it

KAI is built entirely in Go, designed as an agentic orchestration system:

  • Flow runner that executes multi-step investigation workflows with data passing between steps
  • Backend registry supporting system commands (netstat, ps), eBPF CO-RE programs, Hubble flows, and Tetragon security events
  • Agent framework wrapping Anthropic's Claude API with structured request/response schemas
  • eBPF integration using cilium/ebpf library for CO-RE program loading and event streaming
  • YAML-based recipes for sensors (data collectors), actions (automated responses), and flows (investigations)
  • Tool registry that discovers and validates sensors/actions from the recipes directory
  • Parameter templating allowing dynamic sensor configuration based on runtime values

We architected it to be extensible - new investigation workflows can be added through YAML without modifying core code. Each flow step's output becomes the next step's input, with AI seeing the full investigation context.

Challenges we ran into

  • Making AI actually useful: Moving beyond "ChatGPT wrapper" to true autonomous agent that orchestrates tools, preserves context across steps, and executes goal-directed investigations
  • eBPF reliability across kernels: Implementing CO-RE (Compile Once, Run Everywhere) to work on different kernel versions without recompilation
  • Prompt engineering for diagnosis: Crafting prompts that help Claude correlate evidence from disparate sources (TCP stats + packet drops + DNS tests) to reach accurate conclusions
  • Graceful degradation: Ensuring flows continue when individual sensors fail (e.g., missing /proc files on different distros)
  • Real-time data streaming: Reading eBPF ring buffer events efficiently without dropping kernel events
  • Platform compatibility: Making flows work on both Ubuntu and other Linux distributions with varying tool availability

Accomplishments that we're proud of

  • Real production debugging: Successfully diagnosed actual network issues (57% TCP failure rate due to DNS misconfiguration) in 7-16 seconds with 87-96% confidence
  • True autonomous investigation: Not just running commands - full multi-step workflows where each step builds on previous findings
  • Working eBPF integration: Loading real CO-RE programs, capturing kernel events, streaming them to userspace
  • AI correlation that works: Claude analyzing multiple data sources (TCP stats, interface status, DNS tests) and concluding "DNS problem, not network problem"
  • Extensible architecture: 12 production flows, 15 sensors across 7 domains, all added through YAML
  • Developer experience: Simple kaictl run flow.network_latency_rootcause executes complex multi-step investigation

What we learned

  • Agentic AI requires orchestration: Real value comes from AI controlling tools and preserving context, not just answering questions
  • eBPF is incredibly powerful: Kernel-level visibility reveals issues traditional monitoring completely misses (threads blocked, DNS failures, connection patterns)
  • Prompt engineering matters: Well-structured prompts with clear diagnostic patterns (e.g., "if DNS fails but interface healthy → DNS problem") dramatically improve AI accuracy
  • Error handling is critical: Production systems have missing tools, unavailable /proc files, permission issues - graceful degradation is essential
  • Context accumulation wins: Each investigation step feeding the next creates exponentially more value than isolated tool execution
  • Schema validation saves time: Strong typing for sensors/flows caught bugs early and made the codebase maintainable

What's next for KAI

Immediate (v0.2 - January 2025):

  • Condition evaluation (skip steps based on results, e.g., "only flush conntrack if confidence > 90%")
  • Action execution (currently actions are logged only - make them actually execute)
  • Template variables ({{ step1.output.field }} for dynamic data passing)
  • Error handling and retry logic

Short-term (v0.3 - February 2025):

  • Incident memory system using vector database for similarity search
  • Learning from history - AI references past investigations for better diagnosis
  • More eBPF programs (lock contention, scheduler latency, heap profiling)
  • Kubernetes API backend for pod/deployment analysis

Medium-term (v0.4 - March 2025):

  • Trigger system (Prometheus alerts → auto-investigate)
  • Auto-remediation with approval workflows and safety policies
  • Parallel step execution (run multiple sensors concurrently)
  • Web UI for flow creation and incident visualization

Long-term (v1.0 - Q2 2025):

  • Multi-cluster support and distributed tracing
  • Cost optimization recommendations
  • Compliance monitoring
  • Self-learning from incident outcomes

BUILT WITH TAGS

Go
eBPF
Anthropic Claude
Linux Kernel
System Observability
Cilium
Tetragon
Network Debugging
AI Agents
Autonomous Systems
BPF CO-RE
YAML
Infrastructure Automation

"TRY IT OUT" LINKS

Link 1 Title: GitHub Repository
Link 1 URL: https://github.com/sameehj/kai

Link 2 Title: Documentation
Link 2 URL: https://sameehj.github.io/kai/

VIDEO DEMO LINK

https://youtu.be/rNhxZZqHf9A?si=w5EN3B8O-qrFohpi

Built With

Share this project:

Updates