GitHub Deep-Audit

A complete GitHub repository auditor that scans entire projects — written in any programming language — and generates a comprehensive, structured audit report powered by Google's Gemini 3 Pro AI model.

Overview

GitHub Deep-Audit accepts a public GitHub repository URL and an optional project description, then:

Clones the repository (shallow clone, depth=1)
Reads all eligible source files, filtering out binaries, lock files, and build artifacts
Validates whether the project has a sufficient description (via AI)
Audits the entire codebase for security vulnerabilities, API key leaks, backdoors, architecture issues, code quality problems, and more (via AI)
Validates the audit report structure for correctness
Returns a structured JSON audit report with an overall score, categorized issues, architecture review, and prevention guide

The entire pipeline runs asynchronously. The client submits an audit request, receives a job ID, and polls for results.

Tech Stack

Layer	Technology
Backend	Spring Boot 4.0.2, Java 25
AI	Google Gemini 3 Pro via LangChain4j 1.11.0
Git	Eclipse JGit 7.1.0
Build	Maven (wrapper included)
Frontend	Angular 20, Tailwind V4 (separate repository)

Architecture

System Flow

Frontend (Angular) ⇔ Backend (Spring Boot) ⇔ Gemini 3 Pro API

AI Agent Orchestration

The backend uses multimodal AI agent orchestration with three sequential agents:

Agent	Input	Purpose
DescriptionValidatorAgent	README + project description	Checks if the project is well-described enough for a meaningful audit
AuditAgent	Full project files + README + description	Performs the deep code audit and generates the structured report
StructureValidatorAgent	Raw JSON from AuditAgent	Validates JSON structure; if invalid, asks Gemini to fix it

Key design: Only the AuditAgent receives the full source code. The DescriptionValidatorAgent only sees the README + description. The StructureValidatorAgent only sees the JSON output.

Async Job Pipeline

POST /api/v1/audit
    │
    ▼
AuditJobManager.create()              → QUEUED
    │
    ▼  (@Async on thread pool)
GitService.validateRepo()             → CLONING
GitService.cloneRepo()
    │
    ▼
FileReaderService.readProject()        → READING
    │
    ▼
DescriptionValidatorAgent.validate()   → VALIDATING_DESCRIPTION
    │
    ▼
AuditAgent.audit()                     → AUDITING
    │
    ▼
StructureValidatorAgent.validateAndParse() → VALIDATING_STRUCTURE
    │
    ▼
AuditJobManager.complete(report)       → COMPLETED

The client polls GET /api/v1/audit/{jobId} to track progress through each stage.

Project Structure

src/main/java/com/junaidsultan/github_deep_audit/
├── GitHubDeepAuditApplication.java       # Entry point (@EnableAsync, @EnableScheduling)
│
├── config/
│   ├── GeminiConfig.java                 # ChatModel bean (Gemini 3 Pro via LangChain4j)
│   ├── GeminiProperties.java             # Gemini config properties (api-key, model, temp)
│   ├── AuditProperties.java              # Audit config properties (max size, temp dir, TTL)
│   ├── AsyncConfig.java                  # Thread pool executor for async audit jobs
│   └── WebConfig.java                    # CORS configuration + GitHub RestClient bean
│
├── controller/
│   └── AuditController.java              # REST API: POST /audit, GET /audit/{jobId}
│
├── service/
│   ├── AuditOrchestratorService.java     # @Async orchestrator — runs the full pipeline
│   ├── GitService.java                   # Validates repo (GitHub API) + shallow clones (JGit)
│   ├── FileReaderService.java            # Walks repo tree, reads files, finds README
│   ├── AuditJobManager.java              # In-memory job store (ConcurrentHashMap)
│   └── agent/
│       ├── DescriptionValidatorAgent.java # Agent 1: validates project description
│       ├── AuditAgent.java                # Agent 2: deep audit of full codebase
│       └── StructureValidatorAgent.java   # Agent 3: validates/fixes JSON structure
│
├── model/
│   ├── request/
│   │   └── AuditRequest.java             # { repoUrl, projectDescription }
│   ├── response/
│   │   ├── AuditReport.java              # Full audit report
│   │   ├── IssueSummary.java             # Issue counts by severity
│   │   ├── AuditIssue.java               # Individual issue details
│   │   ├── ArchitectureReview.java        # Architecture strengths/weaknesses
│   │   ├── PreventionGuide.java           # Prevention recommendations
│   │   ├── AuditJobResponse.java          # Polling response (status + report)
│   │   └── ErrorResponse.java             # Error response format
│   ├── job/
│   │   └── AuditJob.java                 # Job state object (thread-safe)
│   ├── agent/
│   │   └── DescriptionValidation.java    # Agent 1 response { wellDescribed, reason }
│   └── enums/
│       ├── AuditStage.java               # QUEUED → CLONING → ... → COMPLETED/FAILED
│       ├── IssueSeverity.java            # CRITICAL, HIGH, MEDIUM, LOW
│       └── IssueCategory.java            # SECURITY, VULNERABILITY, API_LEAK, etc.
│
├── exception/
│   ├── GlobalExceptionHandler.java       # @RestControllerAdvice — maps exceptions to HTTP
│   ├── RepositoryTooLargeException.java
│   ├── RepositoryCloneException.java
│   ├── InsufficientDescriptionException.java
│   ├── AuditJobNotFoundException.java
│   └── AuditException.java
│
├── cleanup/
│   └── CleanupScheduler.java            # @Scheduled — purges expired jobs + temp dirs
│
└── util/
    ├── FileFilterUtil.java               # Filters out binaries, lock files, build dirs
    └── GitHubUrlParser.java              # Parses owner/repo from GitHub URLs

Prerequisites

Java 25 (with preview features)
Gemini API Key from Google AI Studio
Internet access (to hit GitHub API and clone repos)

Maven is not required globally — the project includes a Maven wrapper (mvnw.cmd / mvnw).

Setup & Installation

Clone the project

git clone https://github.com/your-username/github-deep-audit.git
cd github-deep-audit

Set the Gemini API key as an environment variable

Windows (CMD):

set GEMINI_API_KEY=your-gemini-api-key-here

Windows (PowerShell):

$env:GEMINI_API_KEY = "your-gemini-api-key-here"

Linux/macOS:

export GEMINI_API_KEY=your-gemini-api-key-here

Build the project
```
mvnw.cmd compile
```

Configuration

All configuration is in src/main/resources/application.yaml:

gemini:
  api-key: ${GEMINI_API_KEY}        # Required — set via environment variable
  model: gemini-3-pro               # Gemini model to use
  temperature: 0.2                  # Lower = more deterministic
  max-output-tokens: 65536          # Max tokens in Gemini response

audit:
  max-repo-size-mb: 50              # Reject repos larger than this
  temp-dir: ${java.io.tmpdir}/github-deep-audit  # Where repos are cloned
  job-ttl-minutes: 30               # Jobs older than this are cleaned up
  cleanup-interval-minutes: 5       # How often cleanup runs

server:
  port: 8080

Running the Application

mvnw.cmd spring-boot:run

The server starts on http://localhost:8080. You'll see:

Started GitHubDeepAuditApplication in X.XXX seconds

API Documentation

Base URL: http://localhost:8080

1. Start an Audit

POST /api/v1/audit

Submits a GitHub repository for auditing. The audit runs asynchronously — this endpoint returns immediately with a job ID.

Request

Field	Type	Required	Description
`repoUrl`	`string`	Yes	Public GitHub repository URL (e.g., `https://github.com/owner/repo`)
`projectDescription`	`string`	No	Optional description of the project to help the AI understand its purpose

{
  "repoUrl": "https://github.com/owner/repo",
  "projectDescription": "A REST API for managing user authentication using JWT tokens"
}

Response — `202 Accepted`

{
  "jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "QUEUED",
  "message": "Audit job created and queued for processing",
  "report": null
}

Validation Errors — `400 Bad Request`

{
  "status": 400,
  "error": "Validation Error",
  "message": "repoUrl: Must be a valid GitHub repository URL (e.g., https://github.com/owner/repo)",
  "timestamp": "2026-02-08T18:00:00Z"
}

cURL Example

curl -X POST http://localhost:8080/api/v1/audit \
  -H "Content-Type: application/json" \
  -d '{
    "repoUrl": "https://github.com/kelseyhightower/nocode",
    "projectDescription": "No code application"
  }'

PowerShell Example

Invoke-RestMethod -Method POST `
  -Uri "http://localhost:8080/api/v1/audit" `
  -ContentType "application/json" `
  -Body '{"repoUrl": "https://github.com/kelseyhightower/nocode", "projectDescription": "No code application"}'

2. Poll Audit Status

GET /api/v1/audit/{jobId}

Polls the current status of an audit job. Returns the full audit report when the job completes.

Path Parameters

Parameter	Type	Description
`jobId`	`UUID`	The job ID returned by the POST endpoint

Response — `200 OK` (In Progress)

{
  "jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "AUDITING",
  "message": "Performing deep audit analysis...",
  "report": null
}

The status field progresses through these stages:

Status	Message	Meaning
`QUEUED`	Audit job is queued for processing	Waiting for a thread
`CLONING`	Cloning repository...	Validating repo metadata + cloning
`READING`	Reading project files...	Walking file tree + reading source files
`VALIDATING_DESCRIPTION`	Validating project description...	AI checking if description is sufficient
`AUDITING`	Performing deep audit analysis...	AI performing the deep code audit
`VALIDATING_STRUCTURE`	Validating audit report structure...	Verifying JSON structure of the report
`COMPLETED`	Audit completed successfully	Done — report is included
`FAILED`	(dynamic error message)	Something went wrong

Response — `200 OK` (Completed)

When status is COMPLETED, the report field contains the full audit report:

{
  "jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "COMPLETED",
  "message": "Audit completed successfully",
  "report": {
    "repositoryUrl": "https://github.com/owner/repo",
    "auditTimestamp": "2026-02-08T18:05:30Z",
    "projectSummary": "A REST API for user authentication...",
    "overallScore": 72,
    "issueSummary": {
      "totalIssues": 8,
      "critical": 1,
      "high": 2,
      "medium": 3,
      "low": 2
    },
    "issues": [
      {
        "id": "ISSUE-001",
        "category": "API_LEAK",
        "severity": "CRITICAL",
        "title": "Hardcoded API Key in Configuration",
        "description": "The file contains a hardcoded API key that is committed to version control...",
        "filePath": "src/config/api.js",
        "lineNumbers": "12-14",
        "codeSnippet": "const API_KEY = 'sk-abc123...'",
        "recommendation": "Move the API key to an environment variable and use a .env file...",
        "impact": "Anyone with access to the repository can extract the API key and abuse the service..."
      }
    ],
    "architectureReview": {
      "summary": "The project follows a basic MVC pattern with some deviations...",
      "strengths": [
        "Clear separation of routes and controllers",
        "Consistent error handling middleware"
      ],
      "weaknesses": [
        "No dependency injection — tightly coupled modules",
        "Business logic mixed into controller layer"
      ]
    },
    "preventionGuide": {
      "summary": "Most issues stem from inadequate secret management and missing input validation...",
      "recommendations": [
        "Use environment variables for all secrets and API keys",
        "Implement input validation on all user-facing endpoints",
        "Add a pre-commit hook to scan for leaked secrets"
      ]
    }
  }
}

Response — `422 Unprocessable Content` (Failed)

{
  "jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "FAILED",
  "message": "Repository size (120 MB) exceeds the maximum allowed size (50 MB)",
  "report": null
}

Response — `404 Not Found`

{
  "status": 404,
  "error": "Job Not Found",
  "message": "Audit job not found: a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "timestamp": "2026-02-08T18:00:00Z"
}

cURL Example

curl http://localhost:8080/api/v1/audit/a1b2c3d4-e5f6-7890-abcd-ef1234567890

Audit Report Schema

The complete audit report JSON schema used across frontend and backend:

{
  "repositoryUrl": "string",
  "auditTimestamp": "string (ISO-8601)",
  "projectSummary": "string",
  "overallScore": "integer (0-100)",
  "issueSummary": {
    "totalIssues": "integer",
    "critical": "integer",
    "high": "integer",
    "medium": "integer",
    "low": "integer"
  },
  "issues": [
    {
      "id": "string (ISSUE-001 format)",
      "category": "SECURITY | VULNERABILITY | API_LEAK | BACKDOOR | ERROR | WARNING | ARCHITECTURE | PERFORMANCE | CODE_QUALITY | BEST_PRACTICE",
      "severity": "CRITICAL | HIGH | MEDIUM | LOW",
      "title": "string",
      "description": "string",
      "filePath": "string",
      "lineNumbers": "string (e.g., '12-14')",
      "codeSnippet": "string",
      "recommendation": "string",
      "impact": "string"
    }
  ],
  "architectureReview": {
    "summary": "string",
    "strengths": ["string"],
    "weaknesses": ["string"]
  },
  "preventionGuide": {
    "summary": "string",
    "recommendations": ["string"]
  }
}

Issue Categories

Category	Description
`SECURITY`	General security issues (XSS, CSRF, SQL injection)
`VULNERABILITY`	Known vulnerability patterns
`API_LEAK`	Hardcoded API keys, secrets, credentials
`BACKDOOR`	Suspicious code that could be a backdoor
`ERROR`	Bugs and runtime errors
`WARNING`	Code smells and minor issues
`ARCHITECTURE`	Structural/design pattern issues
`PERFORMANCE`	Performance bottlenecks
`CODE_QUALITY`	Readability, maintainability issues
`BEST_PRACTICE`	Deviations from best practices

Issue Severities

Severity	Description
`CRITICAL`	Must fix immediately — security breach, data loss risk
`HIGH`	Should fix soon — significant bugs or vulnerabilities
`MEDIUM`	Fix when possible — code quality, minor security concerns
`LOW`	Nice to have — style, best practice suggestions

Error Handling

All errors follow a consistent format:

{
  "status": 400,
  "error": "Error Type",
  "message": "Human-readable error message",
  "timestamp": "2026-02-08T18:00:00Z"
}

HTTP Status	Error	Cause
`400`	Validation Error	Invalid request body (bad URL format, missing fields)
`400`	Repository Too Large	Repository exceeds the 50MB size limit
`400`	Clone Failed	Repository doesn't exist, is private, or can't be cloned
`400`	Insufficient Description	Project lacks adequate description for a meaningful audit
`404`	Job Not Found	The polled job ID doesn't exist
`422`	(varies)	Audit job failed during execution (returned via polling)
`500`	Audit Error	Gemini API failure or internal processing error
`500`	Internal Server Error	Unexpected server error

Audit Pipeline Stages

Each audit job progresses through these stages:

QUEUED → CLONING → READING → VALIDATING_DESCRIPTION → AUDITING → VALIDATING_STRUCTURE → COMPLETED
                                                                                          ↓
                                              (any stage can fail) ─────────────────→ FAILED

Stage	What Happens
QUEUED	Job created, waiting for an available thread in the pool
CLONING	Calls GitHub API to verify repo is public and under 50MB, then shallow-clones via JGit
READING	Walks the cloned repo's file tree, filters out non-source files, reads content into memory
VALIDATING_DESCRIPTION	Sends README + project description to Gemini to check if it's sufficient for an audit
AUDITING	Sends all project files + README + description to Gemini for deep analysis
VALIDATING_STRUCTURE	Attempts to parse the audit JSON; if parsing fails, asks Gemini to fix the structure
COMPLETED	Audit report is ready and available via the polling endpoint
FAILED	Something went wrong — error message describes the cause

File Filtering

When reading repository files, the following are excluded:

Directories: .git, node_modules, vendor, venv, target, build, dist, out, .idea, .vscode, .next, coverage, and more.

File extensions: .class, .jar, .exe, .png, .jpg, .mp3, .mp4, .zip, .pdf, .woff, .min.js, .min.css, .map, and more.

Specific files: package-lock.json, yarn.lock, pnpm-lock.yaml, .DS_Store

Per-file size limit: Files larger than 512KB are skipped.

Cleanup & Job Lifecycle

Job TTL: Audit jobs are kept in memory for 30 minutes (configurable).
Cleanup scheduler: Runs every 5 minutes, removes expired jobs and deletes their cloned repository directories from the temp folder.
Thread pool: 5 core threads, max 10, queue capacity of 20. Uses CallerRunsPolicy when saturated.

Deployment

Backend — Google Cloud

The backend is designed for deployment on Google Cloud (Cloud Run or GKE).

Set the GEMINI_API_KEY environment variable in your deployment configuration.

Frontend — Vercel

The Angular frontend is deployed separately on Vercel. Ensure CORS origins in WebConfig.java include your production frontend URL.

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

junaidsultanxyz/github-deep-audit

Folders and files

Latest commit

History

Repository files navigation

GitHub Deep-Audit

Table of Contents

Overview

Tech Stack

Architecture

System Flow

AI Agent Orchestration

Async Job Pipeline

Project Structure

Prerequisites

Setup & Installation

Configuration

Running the Application

API Documentation

1. Start an Audit

Request

Response — 202 Accepted

Validation Errors — 400 Bad Request

cURL Example

PowerShell Example

2. Poll Audit Status

Path Parameters

Response — 200 OK (In Progress)

Response — 200 OK (Completed)

Response — 422 Unprocessable Content (Failed)

Response — 404 Not Found

cURL Example

Audit Report Schema

Issue Categories

Issue Severities

Error Handling

Audit Pipeline Stages

File Filtering

Cleanup & Job Lifecycle

Deployment

Backend — Google Cloud

Frontend — Vercel

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Response — `202 Accepted`

Validation Errors — `400 Bad Request`

Response — `200 OK` (In Progress)

Response — `200 OK` (Completed)

Response — `422 Unprocessable Content` (Failed)

Response — `404 Not Found`

Packages