Inspiration
As AI-powered threats accelerate and real-world breaches continue to rise, organizations of all sizes face a growing challenge: security mechanisms are deployed once, but rarely tested continuously. Traditional penetration testing is expensive, slow, and difficult to scale, while skilled offensive security talent remains scarce due to the highly specialized nature of the field.
At the same time, modern development cycles move faster than periodic audits can keep up with. AutoRed was inspired by this continuous open-problem, bridging gap to facilitate the need for continuous, intelligent security testing with the realities of compliance, cost, scale, and speed.
What it does
AutoRed is an autonomous, multi-agent penetration testing framework that continuously assess applications and infrastructure using LLM-driven reasoning combined with real-world security tools. Instead of relying on signature-based scanners or infrequent pentests, AutoRed orchestrates specialized agents to perform reconnaissance/information gathering, enumeration, vulnerability analysis, and controlled exploitation in a policy-driven and sandboxed environment.
AutoRed can test live web applications and codebases for both known and previously unknown vulnerabilities, including SQL injection, XSS, LFI/RFI, missing security headers, authentication flaws, weak TLS Cipher suites and more. It produces audit-ready reports (HTML, PDF, JSON) with actionable remediation guidance and supports scheduled or CI/CD-style execution to continuously improve an organization’s security posture.
How we built it
AutoRed was built around a LangGraph-based orchestration layer that coordinates multiple specialized agents, each responsible for a specific phase of the penetration testing lifecycle. Large language models are used reasoning, prioritization, and adaptive strategy selection, while all actual testing is performed using established security tools inside a hardened Docker sandbox that is intelligently and adaptively guided by the orchestration agent. This allows for fast, accurate, and dynamic testing that significantly outperforms manual pentests in most environments according to recent research.
Key architectural components include:
- A Supervisor agent that routes tasks and manages workflow state
- A ThreatModeler agent using STRIDE/DREAD to prioritize attack paths
- Reconnaissance, fuzzing, vulnerability scanning, and OSINT agents
- A policy engine enforcing scope validation, rate limits, and safety guardrails
- Multi-model support (Claude, GPT, Gemini) routed by task type
- Automated report generation with structured JSON outputs for easy integration
Challenges we ran into
- We had issues initially getting the agents to continuously improve on themselves and the strategies used. This was solved via an innovative "Evolutionary" technique using various prompt injection techniques and re-inforcement loops to continuously improve on itself and attack methodology based off of larger amounts of information gathering.
- The frontend/backend communicating with one another porperly to conduct scans and edit the configuration from the dashboard to run the command line tool in the background.
Accomplishments that we're proud of
- This is a topic Evan Pardon (supasuge) has researched for quite some time. As the only person with Offensive Security knowledge and experience in the group, it was difficult to overcome the complexity initially as a group, as it is too much prerequisite knowledge to effectively explain and understand. However, we still effectively delegated tasks properly where our individual skill sets were able to excel, and work cohesively.
What we learned
- Combining LLM-driven reasoning with deterministic security tooling produces better results than either alone: the model can triage, add context, and chain hypotheses, while traditional tools deliver repeatable, verifiable signal.
- The LLM is most valuable as an orchestrator: it reduces noise, ranks findings by likelihood/impact, and connects “small” indicators into actionable attack paths; without replacing ground-truth scanners and system telemetry. Especially when paired with various psychological tricks, these agents are just as capable, if not significantly more so than a skilled and experienced penetration tester.
- For autonomous or semi-autonomous security workflows, policy, safety controls, and auditability aren’t optional. Every action needs to be scoped, logged, explainable, and reversible to be usable in real environments and provide value.
- Threat modeling is essential for an effective orchestration workflow and planning/attack methodology.
What's next for AutoRT
- Active Directory agents: add environment-aware modules for AD discovery and validation (identity, GPO, delegation, ACLs, Kerberos misconfigurations), producing evidence-backed findings with clear remediation steps.
- Linux agents: expand host-level assessment capabilities (service exposure, auth weaknesses, privilege boundary issues, configuration drift), again emphasizing reproducibility and operator review.
- More capabilities + better reporting: continuously grow the toolbelt where it adds unique value, and output structured, reproducible reports (steps to verify, impact, affected assets, and prioritized remediation) to support analyst triage and patching.
Log in or sign up for Devpost to join the conversation.