Inspiration
DevOps teams spend 2+ hours manually analyzing logs and fixing incidents. We wanted to create autonomous AI agents that could collaborate to solve these problems in seconds, not hours—reducing downtime and letting engineers focus on innovation instead of firefighting.
What it does
AgentOps is a multi-agent DevOps platform that autonomously resolves infrastructure issues using NVIDIA NIMs on Amazon SageMaker.
Three Autonomous Agents:
Task Analyzer: Breaks down complex requests into actionable subtasks using NVIDIA LLaMA 3.1-Nemotron-Nano-8B Retrieval Agent: Searches past incidents using NVIDIA NV-Embed-v2 embeddings for context Task Executor: Executes solutions autonomously with confidence scoring and risk assessment Results: 99.7% time reduction (2 hours → 45 seconds) with 85-95% confidence scores.
How we built it
NVIDIA NIMs: Deployed LLaMA 3.1-Nemotron-Nano-8B and NV-Embed-v2 as SageMaker endpoints AWS Lambda: 3 serverless functions for agent logic with public HTTPS endpoints DynamoDB: Stores tasks, agent memory, and knowledge base RAG Implementation: Retrieval Agent searches 10 pre-seeded past incidents to provide AI with historical context Web Interface: Static site on GitHub Pages calling Lambda functions directly Multi-Agent Communication: Agents collaborate via HTTP, sharing context through DynamoDB
Challenges we ran into
SageMaker IAM permissions - Fixed S3 access for model artifacts DynamoDB float errors - Implemented Decimal conversion for all numeric data CORS conflicts - Resolved duplicate headers between Lambda Function URL and code Lambda handler mismatch - Corrected import paths causing 502 errors Missing dependencies - Packaged requests library with Lambda deployment Model compliance - Switched from generic LLaMA to required Nemotron Nano model
Accomplishments that we're proud of
Built a true multi-agent system where agents actually collaborate, not just a chatbot Successfully implemented Retrieval-Augmented Generation (RAG) with vector search Deployed both NVIDIA NIMs on SageMaker meeting all hackathon requirements Created a knowledge base with 10 past incidents that agents use for context Achieved real-time AI responses with unique analysis for every prompt Everything is production-ready and publicly accessible
What we learned
How to deploy NVIDIA NIMs as SageMaker endpoints with custom inference scripts Building multi-agent systems with HTTP-based agent communication Implementing RAG using NVIDIA embeddings for semantic similarity search AWS serverless architecture with Lambda Function URLs and DynamoDB Debugging complex distributed systems with CloudWatch logs Rapid iteration and prioritization under tight time constraints
What's next for AgentOps - Multi-Agent DevOps Platform
Short-term:
Integrate Rerank NIM for better retrieval accuracy Connect to real CloudWatch logs for live incident detection Implement safe execution of AWS API calls (auto-scaling, service restarts)
Long-term:
Self-learning system that stores successful resolutions automatically Predictive maintenance to prevent issues before they cause outages Enterprise features: multi-tenancy, RBAC, Slack/Teams integration Cross-service orchestration for complex distributed system issues
Built With
- amazon-dynamodb
- aws-lambda
- boto3
- github
- javascript
- lambda-function-urls
- nvidia
- nvidia-llama
- nvidia-nim
- python
- sagemaker
Log in or sign up for Devpost to join the conversation.