A complete GitHub repository auditor that scans entire projects — written in any programming language — and generates a comprehensive, structured audit report powered by Google's Gemini 3 Pro AI model.
- Overview
- Tech Stack
- Architecture
- Project Structure
- Prerequisites
- Setup & Installation
- Configuration
- Running the Application
- API Documentation
- Audit Report Schema
- Error Handling
- Audit Pipeline Stages
- File Filtering
- Cleanup & Job Lifecycle
- Deployment
GitHub Deep-Audit accepts a public GitHub repository URL and an optional project description, then:
- Clones the repository (shallow clone, depth=1)
- Reads all eligible source files, filtering out binaries, lock files, and build artifacts
- Validates whether the project has a sufficient description (via AI)
- Audits the entire codebase for security vulnerabilities, API key leaks, backdoors, architecture issues, code quality problems, and more (via AI)
- Validates the audit report structure for correctness
- Returns a structured JSON audit report with an overall score, categorized issues, architecture review, and prevention guide
The entire pipeline runs asynchronously. The client submits an audit request, receives a job ID, and polls for results.
| Layer | Technology |
|---|---|
| Backend | Spring Boot 4.0.2, Java 25 |
| AI | Google Gemini 3 Pro via LangChain4j 1.11.0 |
| Git | Eclipse JGit 7.1.0 |
| Build | Maven (wrapper included) |
| Frontend | Angular 20, Tailwind V4 (separate repository) |
Frontend (Angular) ⇔ Backend (Spring Boot) ⇔ Gemini 3 Pro API
The backend uses multimodal AI agent orchestration with three sequential agents:
| Agent | Input | Purpose |
|---|---|---|
| DescriptionValidatorAgent | README + project description | Checks if the project is well-described enough for a meaningful audit |
| AuditAgent | Full project files + README + description | Performs the deep code audit and generates the structured report |
| StructureValidatorAgent | Raw JSON from AuditAgent | Validates JSON structure; if invalid, asks Gemini to fix it |
Key design: Only the AuditAgent receives the full source code. The DescriptionValidatorAgent only sees the README + description. The StructureValidatorAgent only sees the JSON output.
POST /api/v1/audit
│
▼
AuditJobManager.create() → QUEUED
│
▼ (@Async on thread pool)
GitService.validateRepo() → CLONING
GitService.cloneRepo()
│
▼
FileReaderService.readProject() → READING
│
▼
DescriptionValidatorAgent.validate() → VALIDATING_DESCRIPTION
│
▼
AuditAgent.audit() → AUDITING
│
▼
StructureValidatorAgent.validateAndParse() → VALIDATING_STRUCTURE
│
▼
AuditJobManager.complete(report) → COMPLETED
The client polls GET /api/v1/audit/{jobId} to track progress through each stage.
src/main/java/com/junaidsultan/github_deep_audit/
├── GitHubDeepAuditApplication.java # Entry point (@EnableAsync, @EnableScheduling)
│
├── config/
│ ├── GeminiConfig.java # ChatModel bean (Gemini 3 Pro via LangChain4j)
│ ├── GeminiProperties.java # Gemini config properties (api-key, model, temp)
│ ├── AuditProperties.java # Audit config properties (max size, temp dir, TTL)
│ ├── AsyncConfig.java # Thread pool executor for async audit jobs
│ └── WebConfig.java # CORS configuration + GitHub RestClient bean
│
├── controller/
│ └── AuditController.java # REST API: POST /audit, GET /audit/{jobId}
│
├── service/
│ ├── AuditOrchestratorService.java # @Async orchestrator — runs the full pipeline
│ ├── GitService.java # Validates repo (GitHub API) + shallow clones (JGit)
│ ├── FileReaderService.java # Walks repo tree, reads files, finds README
│ ├── AuditJobManager.java # In-memory job store (ConcurrentHashMap)
│ └── agent/
│ ├── DescriptionValidatorAgent.java # Agent 1: validates project description
│ ├── AuditAgent.java # Agent 2: deep audit of full codebase
│ └── StructureValidatorAgent.java # Agent 3: validates/fixes JSON structure
│
├── model/
│ ├── request/
│ │ └── AuditRequest.java # { repoUrl, projectDescription }
│ ├── response/
│ │ ├── AuditReport.java # Full audit report
│ │ ├── IssueSummary.java # Issue counts by severity
│ │ ├── AuditIssue.java # Individual issue details
│ │ ├── ArchitectureReview.java # Architecture strengths/weaknesses
│ │ ├── PreventionGuide.java # Prevention recommendations
│ │ ├── AuditJobResponse.java # Polling response (status + report)
│ │ └── ErrorResponse.java # Error response format
│ ├── job/
│ │ └── AuditJob.java # Job state object (thread-safe)
│ ├── agent/
│ │ └── DescriptionValidation.java # Agent 1 response { wellDescribed, reason }
│ └── enums/
│ ├── AuditStage.java # QUEUED → CLONING → ... → COMPLETED/FAILED
│ ├── IssueSeverity.java # CRITICAL, HIGH, MEDIUM, LOW
│ └── IssueCategory.java # SECURITY, VULNERABILITY, API_LEAK, etc.
│
├── exception/
│ ├── GlobalExceptionHandler.java # @RestControllerAdvice — maps exceptions to HTTP
│ ├── RepositoryTooLargeException.java
│ ├── RepositoryCloneException.java
│ ├── InsufficientDescriptionException.java
│ ├── AuditJobNotFoundException.java
│ └── AuditException.java
│
├── cleanup/
│ └── CleanupScheduler.java # @Scheduled — purges expired jobs + temp dirs
│
└── util/
├── FileFilterUtil.java # Filters out binaries, lock files, build dirs
└── GitHubUrlParser.java # Parses owner/repo from GitHub URLs
- Java 25 (with preview features)
- Gemini API Key from Google AI Studio
- Internet access (to hit GitHub API and clone repos)
Maven is not required globally — the project includes a Maven wrapper (mvnw.cmd / mvnw).
-
Clone the project
git clone https://github.com/your-username/github-deep-audit.git cd github-deep-audit -
Set the Gemini API key as an environment variable
Windows (CMD):
set GEMINI_API_KEY=your-gemini-api-key-here
Windows (PowerShell):
$env:GEMINI_API_KEY = "your-gemini-api-key-here"
Linux/macOS:
export GEMINI_API_KEY=your-gemini-api-key-here -
Build the project
mvnw.cmd compile
All configuration is in src/main/resources/application.yaml:
gemini:
api-key: ${GEMINI_API_KEY} # Required — set via environment variable
model: gemini-3-pro # Gemini model to use
temperature: 0.2 # Lower = more deterministic
max-output-tokens: 65536 # Max tokens in Gemini response
audit:
max-repo-size-mb: 50 # Reject repos larger than this
temp-dir: ${java.io.tmpdir}/github-deep-audit # Where repos are cloned
job-ttl-minutes: 30 # Jobs older than this are cleaned up
cleanup-interval-minutes: 5 # How often cleanup runs
server:
port: 8080mvnw.cmd spring-boot:runThe server starts on http://localhost:8080. You'll see:
Started GitHubDeepAuditApplication in X.XXX seconds
Base URL: http://localhost:8080
POST /api/v1/audit
Submits a GitHub repository for auditing. The audit runs asynchronously — this endpoint returns immediately with a job ID.
| Field | Type | Required | Description |
|---|---|---|---|
repoUrl |
string |
Yes | Public GitHub repository URL (e.g., https://github.com/owner/repo) |
projectDescription |
string |
No | Optional description of the project to help the AI understand its purpose |
{
"repoUrl": "https://github.com/owner/repo",
"projectDescription": "A REST API for managing user authentication using JWT tokens"
}{
"jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "QUEUED",
"message": "Audit job created and queued for processing",
"report": null
}{
"status": 400,
"error": "Validation Error",
"message": "repoUrl: Must be a valid GitHub repository URL (e.g., https://github.com/owner/repo)",
"timestamp": "2026-02-08T18:00:00Z"
}curl -X POST http://localhost:8080/api/v1/audit \
-H "Content-Type: application/json" \
-d '{
"repoUrl": "https://github.com/kelseyhightower/nocode",
"projectDescription": "No code application"
}'Invoke-RestMethod -Method POST `
-Uri "http://localhost:8080/api/v1/audit" `
-ContentType "application/json" `
-Body '{"repoUrl": "https://github.com/kelseyhightower/nocode", "projectDescription": "No code application"}'GET /api/v1/audit/{jobId}
Polls the current status of an audit job. Returns the full audit report when the job completes.
| Parameter | Type | Description |
|---|---|---|
jobId |
UUID |
The job ID returned by the POST endpoint |
{
"jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "AUDITING",
"message": "Performing deep audit analysis...",
"report": null
}The status field progresses through these stages:
| Status | Message | Meaning |
|---|---|---|
QUEUED |
Audit job is queued for processing | Waiting for a thread |
CLONING |
Cloning repository... | Validating repo metadata + cloning |
READING |
Reading project files... | Walking file tree + reading source files |
VALIDATING_DESCRIPTION |
Validating project description... | AI checking if description is sufficient |
AUDITING |
Performing deep audit analysis... | AI performing the deep code audit |
VALIDATING_STRUCTURE |
Validating audit report structure... | Verifying JSON structure of the report |
COMPLETED |
Audit completed successfully | Done — report is included |
FAILED |
(dynamic error message) | Something went wrong |
When status is COMPLETED, the report field contains the full audit report:
{
"jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "COMPLETED",
"message": "Audit completed successfully",
"report": {
"repositoryUrl": "https://github.com/owner/repo",
"auditTimestamp": "2026-02-08T18:05:30Z",
"projectSummary": "A REST API for user authentication...",
"overallScore": 72,
"issueSummary": {
"totalIssues": 8,
"critical": 1,
"high": 2,
"medium": 3,
"low": 2
},
"issues": [
{
"id": "ISSUE-001",
"category": "API_LEAK",
"severity": "CRITICAL",
"title": "Hardcoded API Key in Configuration",
"description": "The file contains a hardcoded API key that is committed to version control...",
"filePath": "src/config/api.js",
"lineNumbers": "12-14",
"codeSnippet": "const API_KEY = 'sk-abc123...'",
"recommendation": "Move the API key to an environment variable and use a .env file...",
"impact": "Anyone with access to the repository can extract the API key and abuse the service..."
}
],
"architectureReview": {
"summary": "The project follows a basic MVC pattern with some deviations...",
"strengths": [
"Clear separation of routes and controllers",
"Consistent error handling middleware"
],
"weaknesses": [
"No dependency injection — tightly coupled modules",
"Business logic mixed into controller layer"
]
},
"preventionGuide": {
"summary": "Most issues stem from inadequate secret management and missing input validation...",
"recommendations": [
"Use environment variables for all secrets and API keys",
"Implement input validation on all user-facing endpoints",
"Add a pre-commit hook to scan for leaked secrets"
]
}
}
}{
"jobId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "FAILED",
"message": "Repository size (120 MB) exceeds the maximum allowed size (50 MB)",
"report": null
}{
"status": 404,
"error": "Job Not Found",
"message": "Audit job not found: a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"timestamp": "2026-02-08T18:00:00Z"
}curl http://localhost:8080/api/v1/audit/a1b2c3d4-e5f6-7890-abcd-ef1234567890The complete audit report JSON schema used across frontend and backend:
{
"repositoryUrl": "string",
"auditTimestamp": "string (ISO-8601)",
"projectSummary": "string",
"overallScore": "integer (0-100)",
"issueSummary": {
"totalIssues": "integer",
"critical": "integer",
"high": "integer",
"medium": "integer",
"low": "integer"
},
"issues": [
{
"id": "string (ISSUE-001 format)",
"category": "SECURITY | VULNERABILITY | API_LEAK | BACKDOOR | ERROR | WARNING | ARCHITECTURE | PERFORMANCE | CODE_QUALITY | BEST_PRACTICE",
"severity": "CRITICAL | HIGH | MEDIUM | LOW",
"title": "string",
"description": "string",
"filePath": "string",
"lineNumbers": "string (e.g., '12-14')",
"codeSnippet": "string",
"recommendation": "string",
"impact": "string"
}
],
"architectureReview": {
"summary": "string",
"strengths": ["string"],
"weaknesses": ["string"]
},
"preventionGuide": {
"summary": "string",
"recommendations": ["string"]
}
}| Category | Description |
|---|---|
SECURITY |
General security issues (XSS, CSRF, SQL injection) |
VULNERABILITY |
Known vulnerability patterns |
API_LEAK |
Hardcoded API keys, secrets, credentials |
BACKDOOR |
Suspicious code that could be a backdoor |
ERROR |
Bugs and runtime errors |
WARNING |
Code smells and minor issues |
ARCHITECTURE |
Structural/design pattern issues |
PERFORMANCE |
Performance bottlenecks |
CODE_QUALITY |
Readability, maintainability issues |
BEST_PRACTICE |
Deviations from best practices |
| Severity | Description |
|---|---|
CRITICAL |
Must fix immediately — security breach, data loss risk |
HIGH |
Should fix soon — significant bugs or vulnerabilities |
MEDIUM |
Fix when possible — code quality, minor security concerns |
LOW |
Nice to have — style, best practice suggestions |
All errors follow a consistent format:
{
"status": 400,
"error": "Error Type",
"message": "Human-readable error message",
"timestamp": "2026-02-08T18:00:00Z"
}| HTTP Status | Error | Cause |
|---|---|---|
400 |
Validation Error | Invalid request body (bad URL format, missing fields) |
400 |
Repository Too Large | Repository exceeds the 50MB size limit |
400 |
Clone Failed | Repository doesn't exist, is private, or can't be cloned |
400 |
Insufficient Description | Project lacks adequate description for a meaningful audit |
404 |
Job Not Found | The polled job ID doesn't exist |
422 |
(varies) | Audit job failed during execution (returned via polling) |
500 |
Audit Error | Gemini API failure or internal processing error |
500 |
Internal Server Error | Unexpected server error |
Each audit job progresses through these stages:
QUEUED → CLONING → READING → VALIDATING_DESCRIPTION → AUDITING → VALIDATING_STRUCTURE → COMPLETED
↓
(any stage can fail) ─────────────────→ FAILED
| Stage | What Happens |
|---|---|
| QUEUED | Job created, waiting for an available thread in the pool |
| CLONING | Calls GitHub API to verify repo is public and under 50MB, then shallow-clones via JGit |
| READING | Walks the cloned repo's file tree, filters out non-source files, reads content into memory |
| VALIDATING_DESCRIPTION | Sends README + project description to Gemini to check if it's sufficient for an audit |
| AUDITING | Sends all project files + README + description to Gemini for deep analysis |
| VALIDATING_STRUCTURE | Attempts to parse the audit JSON; if parsing fails, asks Gemini to fix the structure |
| COMPLETED | Audit report is ready and available via the polling endpoint |
| FAILED | Something went wrong — error message describes the cause |
When reading repository files, the following are excluded:
Directories: .git, node_modules, vendor, venv, target, build, dist, out, .idea, .vscode, .next, coverage, and more.
File extensions: .class, .jar, .exe, .png, .jpg, .mp3, .mp4, .zip, .pdf, .woff, .min.js, .min.css, .map, and more.
Specific files: package-lock.json, yarn.lock, pnpm-lock.yaml, .DS_Store
Per-file size limit: Files larger than 512KB are skipped.
- Job TTL: Audit jobs are kept in memory for 30 minutes (configurable).
- Cleanup scheduler: Runs every 5 minutes, removes expired jobs and deletes their cloned repository directories from the temp folder.
- Thread pool: 5 core threads, max 10, queue capacity of 20. Uses
CallerRunsPolicywhen saturated.
The backend is designed for deployment on Google Cloud (Cloud Run or GKE).
Set the GEMINI_API_KEY environment variable in your deployment configuration.
The Angular frontend is deployed separately on Vercel. Ensure CORS origins in WebConfig.java include your production frontend URL.
TBD