Inframind

The attached image represents the Incident Dashboard, which showcases all the incidents raised within the system.
A snapshot of the Grafana alert currently firing, triggered due to an increase in error log count over time for our application.
Correlation Analysis: Displays the logs fetched and correlated with the alert to identify related events, patterns, and contributing factors
Metric Insight: Displays metrics fetched by the Correlation Agent, showing system performance, trends, and anomalies in real time.
The Correlation Summary showcases the aggregated results derived from correlated logs and metrics. Highlights relationships between system
Root Cause Analysis: Identifies the exact cause of the incident by correlating logs and metrics to pinpoint failures or performance issues.
Remediation Actions: Lists the corrective steps to resolve the incident, restore normal operations, and prevent similar issues in the future
Alert triggers incident creation, correlation fetches logs & metrics, RCA finds cause, remediation suggests fixes.

About the Project

Inspiration

At our company, we operate a small cluster where infrastructure and application layers are tightly coupled. As an application developer, I frequently get pulled into on-call rotations when applications go down—even when the root cause is infrastructure-related, not application bugs. Spending hours at 3 AM correlating logs with metrics, trying to figure out if it's a memory leak in my code or a resource exhaustion at the infrastructure level, became exhausting. I realized that much of this manual detective work could be automated. If a system could instantly correlate infrastructure metrics with application logs and tell me "this is a resource issue, not your code," it would save countless hours and reduce the on-call burden for developers like me. That frustration sparked the idea for Inframind—an intelligent observability layer that does the correlation work automatically.

What It Does

Inframind is an AI-powered observability platform that integrates seamlessly with Grafana to transform how teams respond to incidents. When an alert fires (like error_rate_per_minute), Inframind automatically:

Correlates logs and metrics to identify relationships between system behavior and failures
Performs root cause analysis by analyzing patterns across multiple data sources
Generates actionable remediation steps with both immediate fixes and long-term monitoring recommendations
Provides comprehensive incident summaries that explain what happened, why it happened, and how to fix it

How We Built It

We built Inframind using a modern tech stack focused on real-time data processing and AI-driven analysis:

Backend: Python-based correlation engine that ingests logs and metrics from various sources
Integration Layer: Grafana MCP servers to pull alert data and push insights back to dashboards
AI/ML: Langgraph based agents to orchestrate the complete workflow
LLM: For LLM we used llama-3 1-nemotron-nano-8B-v1 large language reasoning mode, deployed as an NVIDIA NIM inference microservice on EKS and llama-3.2-nv-embedqa-1b-v2 embedding model for historical analysis of incidents.
UI/Frontend: React-based interface for displaying correlation summaries, root cause analysis, and remediation actions

The architecture follows a pipeline approach: Alert Trigger → Data Collection → Correlation Analysis → Root Cause Identification → Remediation Generation → Dashboard Display.

Challenges We Faced

1. Real-time Correlation at Scale
Processing and correlating logs with metrics in real-time was computationally intensive. We had to optimize our algorithms and implement smart caching strategies to ensure sub-minute correlation times.

2. Grafana Integration Complexity
Building a seamless integration that felt native to Grafana's ecosystem required deep understanding of their plugin architecture and API limitations. We iterated multiple times to achieve a smooth user experience.We utilised full power of Model Context Protocol to fetch logs and metrics and alert rules in our application.

What We Learned

The importance of context in observability—raw metrics and logs are valuable, but correlations tell the real story
AI-assisted operations can dramatically reduce MTTR (Mean Time To Resolution) when implemented thoughtfully
User experience matters in DevOps tools—even powerful features fail if they're not intuitive during high-pressure incidents
Building integrations with established platforms like Grafana provides immediate value and adoption potential
Reducing on-call burden for developers requires intelligent routing of incidents based on actual root causes

What's Next for Inframind

ML-Powered Anomaly Detection: Integrate machine learning classifier models directly on logs and metrics for real-time anomaly detection, moving beyond threshold-based alerting to intelligent pattern recognition that adapts to system behavior
Expanded Monitoring Integrations: Extend platform support to include SolarWinds and AWS CloudWatch, enabling organizations with diverse monitoring stacks to benefit from unified correlation analysis
Kubernetes MCP Integration: Integrate with Kubernetes Model Context Protocol (MCP) to execute basic diagnostic and remediation commands automatically—such as pod restarts, resource scaling, and log collection—directly from the remediation interface
Predictive Analytics: Use historical correlation patterns to predict incidents before they occur, shifting from reactive to proactive incident management
Automated Remediation Execution: Move beyond suggestions to automated fix implementation with human approval workflows, reducing manual intervention time
Collaborative Incident Management: Add team communication features for coordinated incident response, including runbook integration and post-mortem generation
Advanced AI Models: Incorporate LLMs for more nuanced root cause explanations and natural language queries like "why did the payment service fail last night?"
Smart On-Call Routing: Automatically route incidents to infrastructure or application teams based on root cause analysis, reducing unnecessary developer on-call burden

Built With

amazon-web-services
api
databases
ecr
eks
iam
kubernetes
langgraph
llm
rag

Updates

Rudra Aggarwal started this project — Nov 04, 2025 10:07 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.