Devpost Submission: Compatibillabuddy


Inspiration

Every ML engineer has lived this nightmare:

$ pip install torch numpy pandas scikit-learn
✅ Successfully installed ...

$ python -c "import torch; print(torch.cuda.is_available())"
❌ False — wrong CUDA version for your GPU driver

$ python -c "import sklearn"
❌ ImportError: numpy ABI incompatibility

Traditional dependency resolvers (pip, uv, poetry) solve version constraints — but ML environments fail for entirely different reasons: hardware mismatches, ABI breaks, and runtime incompatibilities that metadata alone can't capture. Your PyTorch might install fine but silently fall back to CPU because your CUDA toolkit doesn't match your GPU driver. Your scikit-learn might import-crash because it was compiled against a different NumPy ABI.

These failures waste millions of developer hours across the ML ecosystem. We've personally spent entire days debugging "why doesn't my GPU work" only to discover a single version mismatch buried in a stack of 500+ packages.

We asked: What if hardware was treated as a first-class dependency? And what if an autonomous AI agent could not only diagnose these issues but actually fix them — with verification, rollback, and self-correction?

That's Compatibillabuddy.


What it does

Compatibillabuddy is a hardware-aware dependency compatibility framework for Python ML/AI environments with an autonomous Gemini-powered repair agent.

Three Modes of Operation

1. Offline Diagnosis (No AI Required)

pip install compatibillabuddy
compatibuddy doctor

Probes your hardware (GPU, CUDA driver, CPU, OS), inspects all installed packages, and evaluates a curated knowledge base of known-bad combinations. Produces a Rich-formatted terminal report or machine-readable JSON. Zero network calls, zero API keys — works anywhere.

2. Autonomous Repair (Gemini Agent)

pip install "compatibillabuddy[agent]"
export GEMINI_API_KEY="your-key-here"     # Linux/Mac
$env:GEMINI_API_KEY = "your-key-here"     # PowerShell

compatibuddy repair          # Dry-run: shows plan without executing
compatibuddy repair --live   # Executes fixes for real

The agent follows a strict protocol: Snapshot → Diagnose → Plan → Fix → Verify → Rollback if needed. It fixes one issue at a time, verifies each fix by re-running the doctor, and automatically rolls back if a fix makes things worse.

3. Interactive Chat

compatibuddy agent

Multi-turn conversation with the agent. Ask it anything about your environment, tell it to investigate specific packages, or guide the repair process manually.

The Autonomous Repair Loop

flowchart TD
    A["🧑 User: compatibuddy repair"] --> B["📸 Snapshot Environment\n(pip freeze as rollback point)"]
    B --> C["🩺 Run Doctor\n(probe hardware + inspect packages + evaluate rules)"]
    C --> D{Issues Found?}
    D -- No --> E["✅ Report: Environment Healthy"]
    D -- Yes --> F["🧠 Gemini Plans Fix Order\n(critical issues first)"]
    F --> G["🔧 Execute Fix\n(pip install/uninstall with guardrails)"]
    G --> H["🔍 Verify Fix\n(re-run doctor, compare before/after)"]
    H --> I{Improved?}
    I -- Yes --> J{More Issues?}
    J -- Yes --> G
    J -- No --> K["✅ Report: All Fixed + Changelog"]
    I -- No --> L["⏪ Rollback to Snapshot"]
    L --> M["🔄 Try Alternative Fix"]
    M --> H

The agent doesn't guess — it runs real diagnostics via structured tools, executes real pip commands with safety guardrails, and verifies every fix before moving on. If a fix makes things worse, it rolls back automatically and tries an alternative approach.

Safety Guardrails

The repair agent operates under strict constraints to prevent damage:

  • Virtual environment isolation — refuses to modify system Python
  • Snapshot before every change — full rollback capability at any point
  • Dry-run by default — shows what it would do without executing
  • Protected package blocklist — never uninstalls pip, setuptools, wheel, or itself
  • Operation limit — stops after 10 pip commands per session
  • Only pip install/uninstall — no arbitrary shell commands, no rm -rf, no wget
  • Automatic rollback — if verification shows new problems, reverts immediately
  • Exponential backoff — graceful handling of API rate limits (1s → 2s → 4s → 8s → 16s)

9 Structured Agent Tools

The Gemini agent has access to exactly 9 tools — not arbitrary shell access:

Tool Purpose
tool_probe_hardware Detect OS, CPU, GPU, CUDA version via nvidia-smi
tool_inspect_environment List all installed Python packages with versions
tool_run_doctor Full compatibility diagnosis against knowledge base
tool_explain_issue Detailed explanation of a specific diagnosed issue
tool_search_rules Search knowledge base for rules about a package
tool_snapshot_environment Capture pip freeze as timestamped rollback point
tool_run_pip Execute pip install/uninstall with safety guardrails
tool_verify_fix Re-diagnose and compare before/after issue counts
tool_rollback Restore all packages to a previous snapshot

How we built it

Architecture

graph TB
    subgraph CLI["CLI Layer (Typer + Rich)"]
        DOC["compatibuddy doctor"]
        REP["compatibuddy repair"]
        AGT["compatibuddy agent"]
    end

    subgraph AGENT["Agent Layer (Gemini API)"]
        CORE["AgentSession\n(manual tool dispatch loop)"]
        RETRY["_send_with_retry()\n(exponential backoff)"]
        TOOLS["9 Structured Tools"]
        CONFIG["AgentConfig\n(API key, model, retry settings)"]
    end

    subgraph ENGINE["Engine Layer"]
        DOCTOR["diagnose()\n(orchestrator)"]
        RULES["Rule Engine\n(TOML rulepacks)"]
        MODELS["Pydantic v2 Models\n(GPU, packages, issues)"]
        REPORT["Report Formatter\n(Rich console + JSON)"]
    end

    subgraph HW["Hardware Layer"]
        PROBE["probe_hardware()\n(nvidia-smi, platform)"]
        INSPECT["inspect_environment()\n(pip inspect)"]
    end

    DOC --> DOCTOR
    REP --> CORE
    AGT --> CORE
    CORE --> RETRY
    CORE --> TOOLS
    TOOLS --> DOCTOR
    TOOLS --> PROBE
    TOOLS --> INSPECT
    DOCTOR --> PROBE
    DOCTOR --> INSPECT
    DOCTOR --> RULES
    DOCTOR --> REPORT

    style AGENT fill:#e8f4f8,stroke:#2196F3
    style ENGINE fill:#f3e8f4,stroke:#9C27B0
    style HW fill:#e8f4e8,stroke:#4CAF50
    style CLI fill:#fff3e0,stroke:#FF9800

Tech Stack

  • Python 3.10+ — tested on 3.10, 3.11, 3.12, 3.13
  • google-genai SDK — Gemini function calling with manual tool dispatch
  • Pydantic v2 — structured data models with JSON schema export
  • Typer + Rich — beautiful CLI with formatted terminal output
  • packaging — PEP 440 version specifier matching
  • pytest — 276+ unit tests + integration tests
  • ruff — linting and formatting
  • Hatchling — modern Python build backend
  • GitHub Actions — CI/CD on 3 OS × 4 Python versions

Development Methodology

We followed strict TDD (Test-Driven Development) throughout:

  1. Red — Write failing tests first
  2. Green — Implement minimum code to pass
  3. Refactor — Clean up while keeping tests green
  4. Lint — ruff check + format before every commit

Every feature was built tests-first. The repair tools had 13 tests written before a single line of implementation. The retry logic had 11 tests written before _send_with_retry() existed.

Build Timeline (Sprint Phases)

Phase Time What We Built
Foundation ~4d Hardware probing, environment inspection, knowledge base engine, doctor command, Pydantic models, TOML rulepacks, CLI, Rich reports, 230+ tests
A: Repair Tools 1.5d Snapshot, pip execution with guardrails, verify, rollback — 13 tests
B: Autonomous Loop 2d auto_repair() method, RepairResult dataclass, event callbacks — 11 tests
C: CLI Commands 1d compatibuddy agent (interactive) + compatibuddy repair (autonomous) — 17 tests
D: Integration Tests 1d 5 live Gemini API tests, verified end-to-end
E: Hardening 2d Retry with exponential backoff, slim tool outputs, malformed response handling — 11 tests
F: Demo & Polish 3d README rewrite, demo scripts, Devpost submission

Key Design Decisions

1. Manual tool dispatch, not automatic function calling The google-genai SDK supports automatic function calling, but we disabled it. Why? Because automatic mode bypasses our event callback system — the user sees nothing for minutes while tools run silently. Manual dispatch lets us emit progress events (🔧 tool_snapshot_environment()) so the CLI shows real-time feedback.

2. Gemini drives the repair loop, not programmatic orchestration We don't hardcode "run doctor, then fix issue #1, then verify." The agent plans its own repair strategy based on the diagnosis. It decides fix order, chooses between install/uninstall, picks version specifiers, and adapts when things go wrong. This is what makes it a true Marathon Agent — it demonstrates sustained, autonomous reasoning over many tool calls.

3. Hardware as a first-class dependency Traditional resolvers treat packages as nodes in a version graph. We model hardware (GPU vendor, CUDA version, driver version, VRAM, CPU architecture) as constraints that packages must satisfy. This is the fundamental insight: torch==2.6.0+cu124 isn't just a version — it's a statement about what hardware it needs.

4. Slim tool outputs to manage token budget With 575 packages installed, a full model_dump() produced 133K tokens — blowing Gemini's context window. We slim each tool's output to only what the agent needs: hardware summary + issues for doctor, name + version for package inspection. This cut token usage from ~133K to ~2K per tool call.

5. Dry-run by default everywhere Safety-first design. The compatibuddy repair command, the tool_run_pip() function, and the auto_repair() method all default to dry-run mode. You have to explicitly opt in to live execution with --live or dry_run=False.


How to Install and Try It Yourself

Prerequisites

  • Python 3.10 or higher
  • A Gemini API key (free tier works for the agent features)

Step 1: Install from PyPI

# Core framework only (diagnosis, no AI)
pip install compatibillabuddy

# With Gemini-powered AI agent
pip install "compatibillabuddy[agent]"

Step 2: Set Your API Key

The agent commands (compatibuddy repair and compatibuddy agent) require a Gemini API key. If no key is found, the CLI will display a clear error message telling you how to set one:

Error: No Gemini API key found. Set GEMINI_API_KEY environment variable or pass --api-key.

Set it via environment variable:

# Linux / macOS
export GEMINI_API_KEY="your-key-here"

# Windows PowerShell
$env:GEMINI_API_KEY = "your-key-here"

Or pass it directly:

compatibuddy repair --api-key "your-key-here"
compatibuddy agent --api-key "your-key-here"

Step 3: Run It

# Diagnose your environment (no AI, no API key needed)
compatibuddy doctor

# JSON output for automation
compatibuddy doctor --format json

# Autonomous repair — dry run (see what the agent would do)
compatibuddy repair

# Autonomous repair — live mode (actually execute fixes)
compatibuddy repair --live

# Interactive chat with the agent
compatibuddy agent

What You'll See

Doctor output (no API key needed):

╭──────────── Hardware ─────────────╮
│ OS:     Windows 10.0.26200        │
│ CPU:    Intel Core i9 (AMD64)     │
│ Python: 3.12.2                    │
│ GPU:    NVIDIA RTX 4090 Laptop    │
│         (CUDA 12.7, 16GB VRAM)    │
╰───────────────────────────────────╯
╭──── [WARNING] coinstall_conflict ─╮
│ PyTorch and TensorFlow are both   │
│ installed — they may conflict on  │
│ CUDA runtime libraries            │
╰───────────────────────────────────╯

Repair output (requires API key):

🔧 Compatibillabuddy Repair [DRY RUN]
   Model: gemini-3-flash-preview
   Max retries per issue: 3

   🔧 tool_snapshot_environment()
   🔧 tool_run_doctor()
   🔧 tool_search_rules()

Diagnosis Summary:
1. torchaudio (2.2.1+cu121) is incompatible with torch (2.6.0+cu124)
2. PyTorch and TensorFlow are both installed (CUDA conflict)
3. scikit-learn and pandas deprecation warnings

Action 1: Align Torchaudio with PyTorch...

Challenges we ran into

1. Token Budget Explosion (133K tokens per tool call)

Our biggest technical challenge. With 575 packages installed, tool_run_doctor() returned a full model_dump() with every package's name, version, dependencies, location, and installer. That's 133,000 tokens — well beyond what Gemini can reason about effectively. The model returned MALFORMED_FUNCTION_CALL with parts=None, crashing our tool loop.

Solution: We slimmed every tool output to only what the agent actually needs. Doctor returns hardware summary + issues (not the full package list). Environment inspection returns name + version pairs only. This cut token usage by ~98% while preserving all the information the agent needs for diagnosis and repair.

2. Automatic Function Calling Ate Our Progress Events

The google-genai SDK has automatic function calling enabled by default. It silently calls tools, feeds results back to Gemini, and returns only the final response. This meant our carefully designed event callback system never fired — users saw nothing for 5+ minutes while the agent cycled through snapshot → doctor → search → plan.

Solution: We explicitly disabled automatic function calling with AutomaticFunctionCallingConfig(disable=True) and built our own manual dispatch loop. This gives us full control over the tool-call cycle, letting us emit progress events (🔧 tool_snapshot_environment()) that the CLI displays in real-time.

3. Gemini MALFORMED_FUNCTION_CALL Responses

When tool outputs were too large or serialization produced non-JSON-safe values (like Python enum instances), Gemini sometimes returned responses with parts=None and finish_reason=MALFORMED_FUNCTION_CALL. Our code crashed on response.candidates[0].content.parts[0].

Solution: We added _extract_part() — a defensive helper that safely navigates the response structure and returns None instead of crashing. Both chat() and auto_repair() now handle None parts gracefully with informative messages.

4. Pydantic Enum Serialization

The Severity enum in our models serialized as integer values (e.g., 1 for ERROR) via model_dump() but Gemini expected string-serializable JSON. The tool_search_rules function crashed with isinstance() arg 2 must be a type.

Solution: Switched to model_dump(mode="json") for all tool outputs that go through the Gemini API, which serializes enums as their string names ("ERROR" instead of 1).

5. Rate Limiting on Free Tier

Gemini's free tier has strict rate limits. Our integration tests (which make 5 sequential API calls) would pass one test and then get 429 RESOURCE_EXHAUSTED on the rest.

Solution: We built exponential backoff retry logic (_send_with_retry()) that automatically retries on 429/500/502/503 errors with doubling delays (1s → 2s → 4s → 8s → 16s). We also excluded integration tests from the default pytest run so they don't block development.

6. Environment Inspection Performance (29 seconds)

With 575 packages, pip inspect takes nearly 30 seconds. In the autonomous repair loop, the agent might call tool_run_doctor() multiple times (diagnose → fix → verify → fix → verify), each taking 30 seconds.

Solution: We focused on making the agent efficient — fix one issue at a time, verify after each fix, and only re-run the full doctor when necessary. The slim output format also helps Gemini respond faster since it's processing less data.


Accomplishments that we're proud of

🏆 It Found a Real Bug We Didn't Know About

During our first live demo run, the agent discovered that torchaudio 2.2.1+cu121 was incompatible with torch 2.6.0+cu124. We had been running with this mismatch for weeks without realizing it. The agent didn't just flag it — it planned the exact fix (pip install torchaudio with the correct CUDA index URL).

🏆 276+ Tests, All TDD

Every single feature was built tests-first. The repair tools had 13 tests before implementation. The retry logic had 11 tests before _send_with_retry() existed. The CLI commands had 17 tests before we wrote a line of Typer code. This caught bugs early and gave us confidence to refactor aggressively.

🏆 Published on PyPI as a Real Package

This isn't a demo or a notebook. It's a real, installable Python package: pip install compatibillabuddy. It has proper packaging (Hatchling), entry points, optional dependencies ([agent]), CI/CD on GitHub Actions, and semantic versioning. Judges can install and run it in 30 seconds.

🏆 Self-Correcting Agent

The agent doesn't just apply fixes and hope. It verifies every fix by re-running the doctor and comparing issue counts. If a fix introduced new problems, it rolls back to the snapshot and tries an alternative approach. This verify-or-rollback loop is what makes it a true autonomous agent, not a chatbot that suggests commands.

🏆 Works Without AI Too

The compatibuddy doctor command works entirely offline with zero API calls. It probes real hardware via nvidia-smi, inspects real packages via pip, and evaluates real compatibility rules from TOML rulepacks. The AI agent enhances the experience but isn't required for basic diagnosis.

🏆 Novel Framing: Hardware as a Dependency

Existing tools treat dependencies as a version graph. We treat hardware as a first-class constraint: GPU vendor, CUDA version, driver version, VRAM, and CPU architecture all participate in compatibility evaluation. This is a genuinely new approach to the ML dependency problem.

🏆 Production-Grade Safety

Dry-run by default. Virtual environment detection. Protected package blocklist. Operation limits. Snapshot-before-modify. Automatic rollback. The agent can't accidentally destroy your system Python — we designed every guardrail to prevent it.


What we learned

1. Token Budget Management is Critical for Tool-Calling Agents

The biggest lesson: what you return from tools matters as much as what the model says. Returning full data dumps (133K tokens) doesn't just slow things down — it causes the model to produce malformed responses. Designing slim, purpose-specific tool outputs is an essential skill for building reliable agents.

2. Disable Automatic Function Calling for Observability

Automatic function calling is convenient for simple use cases, but for anything requiring progress feedback, error handling, or audit logging, manual dispatch is essential. We need to see what the agent is doing, when, and why — automatic mode makes the agent a black box.

3. Gemini's Function Calling is Remarkably Good at Planning

When given structured tools with clear descriptions, Gemini consistently follows our repair protocol (snapshot → diagnose → plan → fix → verify) without being explicitly prompted at each step. It even prioritizes critical issues first and adapts its strategy when fixes fail. The planning capability is the real power of the Marathon Agent approach.

4. TDD Saves Time, Even Under Hackathon Pressure

Writing tests first felt slow at the start, but it paid off massively during debugging. When the Gemini API returned unexpected responses, we knew exactly which layer was broken because every layer had isolated tests. We never had to do a "works on my machine" debugging session.

5. The ML Dependency Problem is Worse Than We Thought

Building the knowledge base rules forced us to catalog just how many ways ML environments can break. CUDA version mismatches, NumPy ABI boundaries, framework coinstallation conflicts, deprecated APIs, driver version requirements — it's a combinatorial explosion that no resolver currently handles.


What's next for Compatibillabuddy

Short-Term (Next Release)

  • PEP 751 pylock.toml export — Generate lockfiles that include hardware constraints
  • PEP 817 wheel variant awareness — Detect and recommend correct wheel variants for your hardware
  • More rulepacks — JAX, TensorRT, ONNX Runtime, Hugging Face Transformers
  • Community rulepack repository — Let users contribute and share compatibility rules
  • compatibuddy explain command — Deep-dive explanations of specific issues

Medium-Term

  • Multi-environment management — Compare environments, detect drift, sync configs
  • CI/CD integration — GitHub Action that runs compatibuddy doctor in your pipeline
  • VS Code extension — Inline warnings when you pip install something incompatible
  • Conda support — Extend beyond pip to conda environments
  • Windows-native GPU detection — Currently uses nvidia-smi; add WMI/DXGI fallback

Long-Term Vision

  • Predictive compatibility — "If you install package X, here's what will break"
  • Cross-framework migration — "Switch from TensorFlow to PyTorch" with automated dependency swaps
  • Hardware recommendation — "Your model needs X VRAM, but you only have Y — here are your options"
  • Industry standard — Make hardware-aware dependency resolution a first-class concept in the Python packaging ecosystem

Open Source Roadmap

Compatibillabuddy is MIT-licensed and designed for community contribution. The TOML rulepack system means anyone can add new compatibility rules without touching Python code. We plan to establish a community rulepack repository where ML practitioners can share and curate rules for their specific ecosystems.


Links


Built With

  • Python
  • Gemini API (google-genai SDK)
  • Pydantic v2
  • Typer
  • Rich
  • pytest
  • GitHub Actions## Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Compatibillabuddy

Built With

Share this project:

Updates