AgentOps - Multi-Agent DevOps Platform

Inspiration

DevOps teams spend 2+ hours manually analyzing logs and fixing incidents. We wanted to create autonomous AI agents that could collaborate to solve these problems in seconds, not hours—reducing downtime and letting engineers focus on innovation instead of firefighting.

What it does

AgentOps is a multi-agent DevOps platform that autonomously resolves infrastructure issues using NVIDIA NIMs on Amazon SageMaker.

Three Autonomous Agents:

Task Analyzer: Breaks down complex requests into actionable subtasks using NVIDIA LLaMA 3.1-Nemotron-Nano-8B Retrieval Agent: Searches past incidents using NVIDIA NV-Embed-v2 embeddings for context Task Executor: Executes solutions autonomously with confidence scoring and risk assessment Results: 99.7% time reduction (2 hours → 45 seconds) with 85-95% confidence scores.

How we built it

NVIDIA NIMs: Deployed LLaMA 3.1-Nemotron-Nano-8B and NV-Embed-v2 as SageMaker endpoints AWS Lambda: 3 serverless functions for agent logic with public HTTPS endpoints DynamoDB: Stores tasks, agent memory, and knowledge base RAG Implementation: Retrieval Agent searches 10 pre-seeded past incidents to provide AI with historical context Web Interface: Static site on GitHub Pages calling Lambda functions directly Multi-Agent Communication: Agents collaborate via HTTP, sharing context through DynamoDB

Challenges we ran into

SageMaker IAM permissions - Fixed S3 access for model artifacts DynamoDB float errors - Implemented Decimal conversion for all numeric data CORS conflicts - Resolved duplicate headers between Lambda Function URL and code Lambda handler mismatch - Corrected import paths causing 502 errors Missing dependencies - Packaged requests library with Lambda deployment Model compliance - Switched from generic LLaMA to required Nemotron Nano model

Accomplishments that we're proud of

Built a true multi-agent system where agents actually collaborate, not just a chatbot Successfully implemented Retrieval-Augmented Generation (RAG) with vector search Deployed both NVIDIA NIMs on SageMaker meeting all hackathon requirements Created a knowledge base with 10 past incidents that agents use for context Achieved real-time AI responses with unique analysis for every prompt Everything is production-ready and publicly accessible

What we learned

How to deploy NVIDIA NIMs as SageMaker endpoints with custom inference scripts Building multi-agent systems with HTTP-based agent communication Implementing RAG using NVIDIA embeddings for semantic similarity search AWS serverless architecture with Lambda Function URLs and DynamoDB Debugging complex distributed systems with CloudWatch logs Rapid iteration and prioritization under tight time constraints

What's next for AgentOps - Multi-Agent DevOps Platform

Short-term:

Integrate Rerank NIM for better retrieval accuracy Connect to real CloudWatch logs for live incident detection Implement safe execution of AWS API calls (auto-scaling, service restarts)

Long-term:

Self-learning system that stores successful resolutions automatically Predictive maintenance to prevent issues before they cause outages Enterprise features: multi-tenancy, RBAC, Slack/Teams integration Cross-service orchestration for complex distributed system issues

Built With

amazon-dynamodb
aws-lambda
boto3
github
javascript
lambda-function-urls
nvidia
nvidia-llama
nvidia-nim
python
sagemaker

Updates

Rakshith UK started this project — Nov 03, 2025 11:58 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.