Skip to content

junaidsultanxyz/github-deep-audit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Deep-Audit

A complete GitHub repository auditor that scans entire projects — written in any programming language — and generates a comprehensive, structured audit report powered by Google's Gemini 3 Pro AI model.


Table of Contents


Overview

GitHub Deep-Audit accepts a public GitHub repository URL and an optional project description, then:

  1. Clones the repository (shallow clone, depth=1)
  2. Reads all eligible source files, filtering out binaries, lock files, and build artifacts
  3. Validates whether the project has a sufficient description (via AI)
  4. Audits the entire codebase for security vulnerabilities, API key leaks, backdoors, architecture issues, code quality problems, and more (via AI)
  5. Validates the audit report structure for correctness
  6. Returns a structured JSON audit report with an overall score, categorized issues, architecture review, and prevention guide

The entire pipeline runs asynchronously. The client submits an audit request, receives a job ID, and polls for results.


Tech Stack

Layer Technology
Backend Spring Boot 4.0.2, Java 25
AI Google Gemini 3 Pro via LangChain4j 1.11.0
Git Eclipse JGit 7.1.0
Build Maven (wrapper included)
Frontend Angular 20, Tailwind V4 (separate repository)

Architecture

System Flow

Frontend (Angular) ⇔ Backend (Spring Boot) ⇔ Gemini 3 Pro API

AI Agent Orchestration

The backend uses multimodal AI agent orchestration with three sequential agents:

Agent Input Purpose
DescriptionValidatorAgent README + project description Checks if the project is well-described enough for a meaningful audit
AuditAgent Full project files + README + description Performs the deep code audit and generates the structured report
StructureValidatorAgent Raw JSON from AuditAgent Validates JSON structure; if invalid, asks Gemini to fix it

Key design: Only the AuditAgent receives the full source code. The DescriptionValidatorAgent only sees the README + description. The StructureValidatorAgent only sees the JSON output.

Async Job Pipeline

POST /api/v1/audit
    │
    ▼
AuditJobManager.create()              → QUEUED
    │
    ▼  (@Async on thread pool)
GitService.validateRepo()             → CLONING
GitService.cloneRepo()
    │
    ▼
FileReaderService.readProject()        → READING
    │
    ▼
DescriptionValidatorAgent.validate()   → VALIDATING_DESCRIPTION
    │
    ▼
AuditAgent.audit()                     → AUDITING
    │
    ▼
StructureValidatorAgent.validateAndParse() → VALIDATING_STRUCTURE
    │
    ▼
AuditJobManager.complete(report)       → COMPLETED

The client polls GET /api/v1/audit/{jobId} to track progress through each stage.


Project Structure

src/main/java/com/junaidsultan/github_deep_audit/
├── GitHubDeepAuditApplication.java       # Entry point (@EnableAsync, @EnableScheduling)
│
├── config/
│   ├── GeminiConfig.java                 # ChatModel bean (Gemini 3 Pro via LangChain4j)
│   ├── GeminiProperties.java             # Gemini config properties (api-key, model, temp)
│   ├── AuditProperties.java              # Audit config properties (max size, temp dir, TTL)
│   ├── AsyncConfig.java                  # Thread pool executor for async audit jobs
│   └── WebConfig.java                    # CORS configuration + GitHub RestClient bean
│
├── controller/
│   └── AuditController.java              # REST API: POST /audit, GET /audit/{jobId}
│
├── service/
│   ├── AuditOrchestratorService.java     # @Async orchestrator — runs the full pipeline
│   ├── GitService.java                   # Validates repo (GitHub API) + shallow clones (JGit)
│   ├── FileReaderService.java            # Walks repo tree, reads files, finds README
│   ├── AuditJobManager.java              # In-memory job store (ConcurrentHashMap)
│   └── agent/
│       ├── DescriptionValidatorAgent.java # Agent 1: validates project description
│       ├── AuditAgent.java                # Agent 2: deep audit of full codebase
│       └── StructureValidatorAgent.java   # Agent 3: validates/fixes JSON structure
│
├── model/
│   ├── request/
│   │   └── AuditRequest.java             # { repoUrl, projectDescription }
│   ├── response/
│   │   ├── AuditReport.java              # Full audit report
│   │   ├── IssueSummary.java             # Issue counts by severity
│   │   ├── AuditIssue.java               # Individual issue details
│   │   ├── ArchitectureReview.java        # Architecture strengths/weaknesses
│   │   ├── PreventionGuide.java           # Prevention recommendations
│   │   ├── AuditJobResponse.java          # Polling response (status + report)
│   │   └── ErrorResponse.java             # Error response format
│   ├── job/
│   │   └── AuditJob.java                 # Job state object (thread-safe)
│   ├── agent/
│   │   └── DescriptionValidation.java    # Agent 1 response { wellDescribed, reason }
│   └── enums/
│       ├── AuditStage.java               # QUEUED → CLONING → ... → COMPLETED/FAILED
│       ├── IssueSeverity.java            # CRITICAL, HIGH, MEDIUM, LOW
│       └── IssueCategory.java            # SECURITY, VULNERABILITY, API_LEAK, etc.
│
├── exception/
│   ├── GlobalExceptionHandler.java       # @RestControllerAdvice — maps exceptions to HTTP
│   ├── RepositoryTooLargeException.java
│   ├── RepositoryCloneException.java
│   ├── InsufficientDescriptionException.java
│   ├── AuditJobNotFoundException.java
│   └── AuditException.java
│
├── cleanup/
│   └── CleanupScheduler.java            # @Scheduled — purges expired jobs + temp dirs
│
└── util/
    ├── FileFilterUtil.java               # Filters out binaries, lock files, build dirs
    └── GitHubUrlParser.java              # Parses owner/repo from GitHub URLs

Prerequisites

  • Java 25 (with preview features)
  • Gemini API Key from Google AI Studio
  • Internet access (to hit GitHub API and clone repos)

Maven is not required globally — the project includes a Maven wrapper (mvnw.cmd / mvnw).


Setup & Installation

  1. Clone the project

    git clone https://github.com/your-username/github-deep-audit.git
    cd github-deep-audit
  2. Set the Gemini API key as an environment variable

    Windows (CMD):

    set GEMINI_API_KEY=your-gemini-api-key-here

    Windows (PowerShell):

    $env:GEMINI_API_KEY = "your-gemini-api-key-here"

    Linux/macOS:

    export GEMINI_API_KEY=your-gemini-api-key-here
  3. Build the project

    mvnw.cmd compile

Configuration

All configuration is in src/main/resources/application.yaml:

gemini:
  api-key: ${GEMINI_API_KEY}        # Required — set via environment variable
  model: gemini-3-pro               # Gemini model to use
  temperature: 0.2                  # Lower = more deterministic
  max-output-tokens: 65536          # Max tokens in Gemini response

audit:
  max-repo-size-mb: 50              # Reject repos larger than this
  temp-dir: ${java.io.tmpdir}/github-deep-audit  # Where repos are cloned
  job-ttl-minutes: 30               # Jobs older than this are cleaned up
  cleanup-interval-minutes: 5       # How often cleanup runs

server:
  port: 8080

Running the Application

mvnw.cmd spring-boot:run

The server starts on http://localhost:8080. You'll see:

Started GitHubDeepAuditApplication in X.XXX seconds

API Documentation

Base URL: http://localhost:8080

1. Start an Audit

POST /api/v1/audit

Submits a GitHub repository for auditing. The audit runs asynchronously — this endpoint returns immediately with a job ID.

Request

Field Type Required Description
repoUrl string Yes Public GitHub repository URL (e.g., https://github.com/owner/repo)
projectDescription string No Optional description of the project to help the AI understand its purpose
{
  "repoUrl": "https://github.com/owner/repo",
  "projectDescription": "A REST API for managing user authentication using JWT tokens"
}

Response — 202 Accepted

{
  "jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "QUEUED",
  "message": "Audit job created and queued for processing",
  "report": null
}

Validation Errors — 400 Bad Request

{
  "status": 400,
  "error": "Validation Error",
  "message": "repoUrl: Must be a valid GitHub repository URL (e.g., https://github.com/owner/repo)",
  "timestamp": "2026-02-08T18:00:00Z"
}

cURL Example

curl -X POST http://localhost:8080/api/v1/audit \
  -H "Content-Type: application/json" \
  -d '{
    "repoUrl": "https://github.com/kelseyhightower/nocode",
    "projectDescription": "No code application"
  }'

PowerShell Example

Invoke-RestMethod -Method POST `
  -Uri "http://localhost:8080/api/v1/audit" `
  -ContentType "application/json" `
  -Body '{"repoUrl": "https://github.com/kelseyhightower/nocode", "projectDescription": "No code application"}'

2. Poll Audit Status

GET /api/v1/audit/{jobId}

Polls the current status of an audit job. Returns the full audit report when the job completes.

Path Parameters

Parameter Type Description
jobId UUID The job ID returned by the POST endpoint

Response — 200 OK (In Progress)

{
  "jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "AUDITING",
  "message": "Performing deep audit analysis...",
  "report": null
}

The status field progresses through these stages:

Status Message Meaning
QUEUED Audit job is queued for processing Waiting for a thread
CLONING Cloning repository... Validating repo metadata + cloning
READING Reading project files... Walking file tree + reading source files
VALIDATING_DESCRIPTION Validating project description... AI checking if description is sufficient
AUDITING Performing deep audit analysis... AI performing the deep code audit
VALIDATING_STRUCTURE Validating audit report structure... Verifying JSON structure of the report
COMPLETED Audit completed successfully Done — report is included
FAILED (dynamic error message) Something went wrong

Response — 200 OK (Completed)

When status is COMPLETED, the report field contains the full audit report:

{
  "jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "COMPLETED",
  "message": "Audit completed successfully",
  "report": {
    "repositoryUrl": "https://github.com/owner/repo",
    "auditTimestamp": "2026-02-08T18:05:30Z",
    "projectSummary": "A REST API for user authentication...",
    "overallScore": 72,
    "issueSummary": {
      "totalIssues": 8,
      "critical": 1,
      "high": 2,
      "medium": 3,
      "low": 2
    },
    "issues": [
      {
        "id": "ISSUE-001",
        "category": "API_LEAK",
        "severity": "CRITICAL",
        "title": "Hardcoded API Key in Configuration",
        "description": "The file contains a hardcoded API key that is committed to version control...",
        "filePath": "src/config/api.js",
        "lineNumbers": "12-14",
        "codeSnippet": "const API_KEY = 'sk-abc123...'",
        "recommendation": "Move the API key to an environment variable and use a .env file...",
        "impact": "Anyone with access to the repository can extract the API key and abuse the service..."
      }
    ],
    "architectureReview": {
      "summary": "The project follows a basic MVC pattern with some deviations...",
      "strengths": [
        "Clear separation of routes and controllers",
        "Consistent error handling middleware"
      ],
      "weaknesses": [
        "No dependency injection — tightly coupled modules",
        "Business logic mixed into controller layer"
      ]
    },
    "preventionGuide": {
      "summary": "Most issues stem from inadequate secret management and missing input validation...",
      "recommendations": [
        "Use environment variables for all secrets and API keys",
        "Implement input validation on all user-facing endpoints",
        "Add a pre-commit hook to scan for leaked secrets"
      ]
    }
  }
}

Response — 422 Unprocessable Content (Failed)

{
  "jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "FAILED",
  "message": "Repository size (120 MB) exceeds the maximum allowed size (50 MB)",
  "report": null
}

Response — 404 Not Found

{
  "status": 404,
  "error": "Job Not Found",
  "message": "Audit job not found: a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "timestamp": "2026-02-08T18:00:00Z"
}

cURL Example

curl http://localhost:8080/api/v1/audit/a1b2c3d4-e5f6-7890-abcd-ef1234567890

Audit Report Schema

The complete audit report JSON schema used across frontend and backend:

{
  "repositoryUrl": "string",
  "auditTimestamp": "string (ISO-8601)",
  "projectSummary": "string",
  "overallScore": "integer (0-100)",
  "issueSummary": {
    "totalIssues": "integer",
    "critical": "integer",
    "high": "integer",
    "medium": "integer",
    "low": "integer"
  },
  "issues": [
    {
      "id": "string (ISSUE-001 format)",
      "category": "SECURITY | VULNERABILITY | API_LEAK | BACKDOOR | ERROR | WARNING | ARCHITECTURE | PERFORMANCE | CODE_QUALITY | BEST_PRACTICE",
      "severity": "CRITICAL | HIGH | MEDIUM | LOW",
      "title": "string",
      "description": "string",
      "filePath": "string",
      "lineNumbers": "string (e.g., '12-14')",
      "codeSnippet": "string",
      "recommendation": "string",
      "impact": "string"
    }
  ],
  "architectureReview": {
    "summary": "string",
    "strengths": ["string"],
    "weaknesses": ["string"]
  },
  "preventionGuide": {
    "summary": "string",
    "recommendations": ["string"]
  }
}

Issue Categories

Category Description
SECURITY General security issues (XSS, CSRF, SQL injection)
VULNERABILITY Known vulnerability patterns
API_LEAK Hardcoded API keys, secrets, credentials
BACKDOOR Suspicious code that could be a backdoor
ERROR Bugs and runtime errors
WARNING Code smells and minor issues
ARCHITECTURE Structural/design pattern issues
PERFORMANCE Performance bottlenecks
CODE_QUALITY Readability, maintainability issues
BEST_PRACTICE Deviations from best practices

Issue Severities

Severity Description
CRITICAL Must fix immediately — security breach, data loss risk
HIGH Should fix soon — significant bugs or vulnerabilities
MEDIUM Fix when possible — code quality, minor security concerns
LOW Nice to have — style, best practice suggestions

Error Handling

All errors follow a consistent format:

{
  "status": 400,
  "error": "Error Type",
  "message": "Human-readable error message",
  "timestamp": "2026-02-08T18:00:00Z"
}
HTTP Status Error Cause
400 Validation Error Invalid request body (bad URL format, missing fields)
400 Repository Too Large Repository exceeds the 50MB size limit
400 Clone Failed Repository doesn't exist, is private, or can't be cloned
400 Insufficient Description Project lacks adequate description for a meaningful audit
404 Job Not Found The polled job ID doesn't exist
422 (varies) Audit job failed during execution (returned via polling)
500 Audit Error Gemini API failure or internal processing error
500 Internal Server Error Unexpected server error

Audit Pipeline Stages

Each audit job progresses through these stages:

QUEUED → CLONING → READING → VALIDATING_DESCRIPTION → AUDITING → VALIDATING_STRUCTURE → COMPLETED
                                                                                          ↓
                                              (any stage can fail) ─────────────────→ FAILED
Stage What Happens
QUEUED Job created, waiting for an available thread in the pool
CLONING Calls GitHub API to verify repo is public and under 50MB, then shallow-clones via JGit
READING Walks the cloned repo's file tree, filters out non-source files, reads content into memory
VALIDATING_DESCRIPTION Sends README + project description to Gemini to check if it's sufficient for an audit
AUDITING Sends all project files + README + description to Gemini for deep analysis
VALIDATING_STRUCTURE Attempts to parse the audit JSON; if parsing fails, asks Gemini to fix the structure
COMPLETED Audit report is ready and available via the polling endpoint
FAILED Something went wrong — error message describes the cause

File Filtering

When reading repository files, the following are excluded:

Directories: .git, node_modules, vendor, venv, target, build, dist, out, .idea, .vscode, .next, coverage, and more.

File extensions: .class, .jar, .exe, .png, .jpg, .mp3, .mp4, .zip, .pdf, .woff, .min.js, .min.css, .map, and more.

Specific files: package-lock.json, yarn.lock, pnpm-lock.yaml, .DS_Store

Per-file size limit: Files larger than 512KB are skipped.


Cleanup & Job Lifecycle

  • Job TTL: Audit jobs are kept in memory for 30 minutes (configurable).
  • Cleanup scheduler: Runs every 5 minutes, removes expired jobs and deletes their cloned repository directories from the temp folder.
  • Thread pool: 5 core threads, max 10, queue capacity of 20. Uses CallerRunsPolicy when saturated.

Deployment

Backend — Google Cloud

The backend is designed for deployment on Google Cloud (Cloud Run or GKE).

Set the GEMINI_API_KEY environment variable in your deployment configuration.

Frontend — Vercel

The Angular frontend is deployed separately on Vercel. Ensure CORS origins in WebConfig.java include your production frontend URL.


License

TBD

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published