Devpost Submission: Compatibillabuddy
Inspiration
Every ML engineer has lived this nightmare:
$ pip install torch numpy pandas scikit-learn
✅ Successfully installed ...
$ python -c "import torch; print(torch.cuda.is_available())"
❌ False — wrong CUDA version for your GPU driver
$ python -c "import sklearn"
❌ ImportError: numpy ABI incompatibility
Traditional dependency resolvers (pip, uv, poetry) solve version constraints — but ML environments fail for entirely different reasons: hardware mismatches, ABI breaks, and runtime incompatibilities that metadata alone can't capture. Your PyTorch might install fine but silently fall back to CPU because your CUDA toolkit doesn't match your GPU driver. Your scikit-learn might import-crash because it was compiled against a different NumPy ABI.
These failures waste millions of developer hours across the ML ecosystem. We've personally spent entire days debugging "why doesn't my GPU work" only to discover a single version mismatch buried in a stack of 500+ packages.
We asked: What if hardware was treated as a first-class dependency? And what if an autonomous AI agent could not only diagnose these issues but actually fix them — with verification, rollback, and self-correction?
That's Compatibillabuddy.
What it does
Compatibillabuddy is a hardware-aware dependency compatibility framework for Python ML/AI environments with an autonomous Gemini-powered repair agent.
Three Modes of Operation
1. Offline Diagnosis (No AI Required)
pip install compatibillabuddy
compatibuddy doctor
Probes your hardware (GPU, CUDA driver, CPU, OS), inspects all installed packages, and evaluates a curated knowledge base of known-bad combinations. Produces a Rich-formatted terminal report or machine-readable JSON. Zero network calls, zero API keys — works anywhere.
2. Autonomous Repair (Gemini Agent)
pip install "compatibillabuddy[agent]"
export GEMINI_API_KEY="your-key-here" # Linux/Mac
$env:GEMINI_API_KEY = "your-key-here" # PowerShell
compatibuddy repair # Dry-run: shows plan without executing
compatibuddy repair --live # Executes fixes for real
The agent follows a strict protocol: Snapshot → Diagnose → Plan → Fix → Verify → Rollback if needed. It fixes one issue at a time, verifies each fix by re-running the doctor, and automatically rolls back if a fix makes things worse.
3. Interactive Chat
compatibuddy agent
Multi-turn conversation with the agent. Ask it anything about your environment, tell it to investigate specific packages, or guide the repair process manually.
The Autonomous Repair Loop
flowchart TD
A["🧑 User: compatibuddy repair"] --> B["📸 Snapshot Environment\n(pip freeze as rollback point)"]
B --> C["🩺 Run Doctor\n(probe hardware + inspect packages + evaluate rules)"]
C --> D{Issues Found?}
D -- No --> E["✅ Report: Environment Healthy"]
D -- Yes --> F["🧠 Gemini Plans Fix Order\n(critical issues first)"]
F --> G["🔧 Execute Fix\n(pip install/uninstall with guardrails)"]
G --> H["🔍 Verify Fix\n(re-run doctor, compare before/after)"]
H --> I{Improved?}
I -- Yes --> J{More Issues?}
J -- Yes --> G
J -- No --> K["✅ Report: All Fixed + Changelog"]
I -- No --> L["⏪ Rollback to Snapshot"]
L --> M["🔄 Try Alternative Fix"]
M --> H
The agent doesn't guess — it runs real diagnostics via structured tools, executes real pip commands with safety guardrails, and verifies every fix before moving on. If a fix makes things worse, it rolls back automatically and tries an alternative approach.
Safety Guardrails
The repair agent operates under strict constraints to prevent damage:
- Virtual environment isolation — refuses to modify system Python
- Snapshot before every change — full rollback capability at any point
- Dry-run by default — shows what it would do without executing
- Protected package blocklist — never uninstalls pip, setuptools, wheel, or itself
- Operation limit — stops after 10 pip commands per session
- Only pip install/uninstall — no arbitrary shell commands, no rm -rf, no wget
- Automatic rollback — if verification shows new problems, reverts immediately
- Exponential backoff — graceful handling of API rate limits (1s → 2s → 4s → 8s → 16s)
9 Structured Agent Tools
The Gemini agent has access to exactly 9 tools — not arbitrary shell access:
| Tool | Purpose |
|---|---|
tool_probe_hardware |
Detect OS, CPU, GPU, CUDA version via nvidia-smi |
tool_inspect_environment |
List all installed Python packages with versions |
tool_run_doctor |
Full compatibility diagnosis against knowledge base |
tool_explain_issue |
Detailed explanation of a specific diagnosed issue |
tool_search_rules |
Search knowledge base for rules about a package |
tool_snapshot_environment |
Capture pip freeze as timestamped rollback point |
tool_run_pip |
Execute pip install/uninstall with safety guardrails |
tool_verify_fix |
Re-diagnose and compare before/after issue counts |
tool_rollback |
Restore all packages to a previous snapshot |
How we built it
Architecture
graph TB
subgraph CLI["CLI Layer (Typer + Rich)"]
DOC["compatibuddy doctor"]
REP["compatibuddy repair"]
AGT["compatibuddy agent"]
end
subgraph AGENT["Agent Layer (Gemini API)"]
CORE["AgentSession\n(manual tool dispatch loop)"]
RETRY["_send_with_retry()\n(exponential backoff)"]
TOOLS["9 Structured Tools"]
CONFIG["AgentConfig\n(API key, model, retry settings)"]
end
subgraph ENGINE["Engine Layer"]
DOCTOR["diagnose()\n(orchestrator)"]
RULES["Rule Engine\n(TOML rulepacks)"]
MODELS["Pydantic v2 Models\n(GPU, packages, issues)"]
REPORT["Report Formatter\n(Rich console + JSON)"]
end
subgraph HW["Hardware Layer"]
PROBE["probe_hardware()\n(nvidia-smi, platform)"]
INSPECT["inspect_environment()\n(pip inspect)"]
end
DOC --> DOCTOR
REP --> CORE
AGT --> CORE
CORE --> RETRY
CORE --> TOOLS
TOOLS --> DOCTOR
TOOLS --> PROBE
TOOLS --> INSPECT
DOCTOR --> PROBE
DOCTOR --> INSPECT
DOCTOR --> RULES
DOCTOR --> REPORT
style AGENT fill:#e8f4f8,stroke:#2196F3
style ENGINE fill:#f3e8f4,stroke:#9C27B0
style HW fill:#e8f4e8,stroke:#4CAF50
style CLI fill:#fff3e0,stroke:#FF9800
Tech Stack
- Python 3.10+ — tested on 3.10, 3.11, 3.12, 3.13
- google-genai SDK — Gemini function calling with manual tool dispatch
- Pydantic v2 — structured data models with JSON schema export
- Typer + Rich — beautiful CLI with formatted terminal output
- packaging — PEP 440 version specifier matching
- pytest — 276+ unit tests + integration tests
- ruff — linting and formatting
- Hatchling — modern Python build backend
- GitHub Actions — CI/CD on 3 OS × 4 Python versions
Development Methodology
We followed strict TDD (Test-Driven Development) throughout:
- Red — Write failing tests first
- Green — Implement minimum code to pass
- Refactor — Clean up while keeping tests green
- Lint — ruff check + format before every commit
Every feature was built tests-first. The repair tools had 13 tests written before a single line of implementation. The retry logic had 11 tests written before _send_with_retry() existed.
Build Timeline (Sprint Phases)
| Phase | Time | What We Built |
|---|---|---|
| Foundation | ~4d | Hardware probing, environment inspection, knowledge base engine, doctor command, Pydantic models, TOML rulepacks, CLI, Rich reports, 230+ tests |
| A: Repair Tools | 1.5d | Snapshot, pip execution with guardrails, verify, rollback — 13 tests |
| B: Autonomous Loop | 2d | auto_repair() method, RepairResult dataclass, event callbacks — 11 tests |
| C: CLI Commands | 1d | compatibuddy agent (interactive) + compatibuddy repair (autonomous) — 17 tests |
| D: Integration Tests | 1d | 5 live Gemini API tests, verified end-to-end |
| E: Hardening | 2d | Retry with exponential backoff, slim tool outputs, malformed response handling — 11 tests |
| F: Demo & Polish | 3d | README rewrite, demo scripts, Devpost submission |
Key Design Decisions
1. Manual tool dispatch, not automatic function calling
The google-genai SDK supports automatic function calling, but we disabled it. Why? Because automatic mode bypasses our event callback system — the user sees nothing for minutes while tools run silently. Manual dispatch lets us emit progress events (🔧 tool_snapshot_environment()) so the CLI shows real-time feedback.
2. Gemini drives the repair loop, not programmatic orchestration We don't hardcode "run doctor, then fix issue #1, then verify." The agent plans its own repair strategy based on the diagnosis. It decides fix order, chooses between install/uninstall, picks version specifiers, and adapts when things go wrong. This is what makes it a true Marathon Agent — it demonstrates sustained, autonomous reasoning over many tool calls.
3. Hardware as a first-class dependency
Traditional resolvers treat packages as nodes in a version graph. We model hardware (GPU vendor, CUDA version, driver version, VRAM, CPU architecture) as constraints that packages must satisfy. This is the fundamental insight: torch==2.6.0+cu124 isn't just a version — it's a statement about what hardware it needs.
4. Slim tool outputs to manage token budget
With 575 packages installed, a full model_dump() produced 133K tokens — blowing Gemini's context window. We slim each tool's output to only what the agent needs: hardware summary + issues for doctor, name + version for package inspection. This cut token usage from ~133K to ~2K per tool call.
5. Dry-run by default everywhere
Safety-first design. The compatibuddy repair command, the tool_run_pip() function, and the auto_repair() method all default to dry-run mode. You have to explicitly opt in to live execution with --live or dry_run=False.
How to Install and Try It Yourself
Prerequisites
- Python 3.10 or higher
- A Gemini API key (free tier works for the agent features)
Step 1: Install from PyPI
# Core framework only (diagnosis, no AI)
pip install compatibillabuddy
# With Gemini-powered AI agent
pip install "compatibillabuddy[agent]"
Step 2: Set Your API Key
The agent commands (compatibuddy repair and compatibuddy agent) require a Gemini API key. If no key is found, the CLI will display a clear error message telling you how to set one:
Error: No Gemini API key found. Set GEMINI_API_KEY environment variable or pass --api-key.
Set it via environment variable:
# Linux / macOS
export GEMINI_API_KEY="your-key-here"
# Windows PowerShell
$env:GEMINI_API_KEY = "your-key-here"
Or pass it directly:
compatibuddy repair --api-key "your-key-here"
compatibuddy agent --api-key "your-key-here"
Step 3: Run It
# Diagnose your environment (no AI, no API key needed)
compatibuddy doctor
# JSON output for automation
compatibuddy doctor --format json
# Autonomous repair — dry run (see what the agent would do)
compatibuddy repair
# Autonomous repair — live mode (actually execute fixes)
compatibuddy repair --live
# Interactive chat with the agent
compatibuddy agent
What You'll See
Doctor output (no API key needed):
╭──────────── Hardware ─────────────╮
│ OS: Windows 10.0.26200 │
│ CPU: Intel Core i9 (AMD64) │
│ Python: 3.12.2 │
│ GPU: NVIDIA RTX 4090 Laptop │
│ (CUDA 12.7, 16GB VRAM) │
╰───────────────────────────────────╯
╭──── [WARNING] coinstall_conflict ─╮
│ PyTorch and TensorFlow are both │
│ installed — they may conflict on │
│ CUDA runtime libraries │
╰───────────────────────────────────╯
Repair output (requires API key):
🔧 Compatibillabuddy Repair [DRY RUN]
Model: gemini-3-flash-preview
Max retries per issue: 3
🔧 tool_snapshot_environment()
🔧 tool_run_doctor()
🔧 tool_search_rules()
Diagnosis Summary:
1. torchaudio (2.2.1+cu121) is incompatible with torch (2.6.0+cu124)
2. PyTorch and TensorFlow are both installed (CUDA conflict)
3. scikit-learn and pandas deprecation warnings
Action 1: Align Torchaudio with PyTorch...
Challenges we ran into
1. Token Budget Explosion (133K tokens per tool call)
Our biggest technical challenge. With 575 packages installed, tool_run_doctor() returned a full model_dump() with every package's name, version, dependencies, location, and installer. That's 133,000 tokens — well beyond what Gemini can reason about effectively. The model returned MALFORMED_FUNCTION_CALL with parts=None, crashing our tool loop.
Solution: We slimmed every tool output to only what the agent actually needs. Doctor returns hardware summary + issues (not the full package list). Environment inspection returns name + version pairs only. This cut token usage by ~98% while preserving all the information the agent needs for diagnosis and repair.
2. Automatic Function Calling Ate Our Progress Events
The google-genai SDK has automatic function calling enabled by default. It silently calls tools, feeds results back to Gemini, and returns only the final response. This meant our carefully designed event callback system never fired — users saw nothing for 5+ minutes while the agent cycled through snapshot → doctor → search → plan.
Solution: We explicitly disabled automatic function calling with AutomaticFunctionCallingConfig(disable=True) and built our own manual dispatch loop. This gives us full control over the tool-call cycle, letting us emit progress events (🔧 tool_snapshot_environment()) that the CLI displays in real-time.
3. Gemini MALFORMED_FUNCTION_CALL Responses
When tool outputs were too large or serialization produced non-JSON-safe values (like Python enum instances), Gemini sometimes returned responses with parts=None and finish_reason=MALFORMED_FUNCTION_CALL. Our code crashed on response.candidates[0].content.parts[0].
Solution: We added _extract_part() — a defensive helper that safely navigates the response structure and returns None instead of crashing. Both chat() and auto_repair() now handle None parts gracefully with informative messages.
4. Pydantic Enum Serialization
The Severity enum in our models serialized as integer values (e.g., 1 for ERROR) via model_dump() but Gemini expected string-serializable JSON. The tool_search_rules function crashed with isinstance() arg 2 must be a type.
Solution: Switched to model_dump(mode="json") for all tool outputs that go through the Gemini API, which serializes enums as their string names ("ERROR" instead of 1).
5. Rate Limiting on Free Tier
Gemini's free tier has strict rate limits. Our integration tests (which make 5 sequential API calls) would pass one test and then get 429 RESOURCE_EXHAUSTED on the rest.
Solution: We built exponential backoff retry logic (_send_with_retry()) that automatically retries on 429/500/502/503 errors with doubling delays (1s → 2s → 4s → 8s → 16s). We also excluded integration tests from the default pytest run so they don't block development.
6. Environment Inspection Performance (29 seconds)
With 575 packages, pip inspect takes nearly 30 seconds. In the autonomous repair loop, the agent might call tool_run_doctor() multiple times (diagnose → fix → verify → fix → verify), each taking 30 seconds.
Solution: We focused on making the agent efficient — fix one issue at a time, verify after each fix, and only re-run the full doctor when necessary. The slim output format also helps Gemini respond faster since it's processing less data.
Accomplishments that we're proud of
🏆 It Found a Real Bug We Didn't Know About
During our first live demo run, the agent discovered that torchaudio 2.2.1+cu121 was incompatible with torch 2.6.0+cu124. We had been running with this mismatch for weeks without realizing it. The agent didn't just flag it — it planned the exact fix (pip install torchaudio with the correct CUDA index URL).
🏆 276+ Tests, All TDD
Every single feature was built tests-first. The repair tools had 13 tests before implementation. The retry logic had 11 tests before _send_with_retry() existed. The CLI commands had 17 tests before we wrote a line of Typer code. This caught bugs early and gave us confidence to refactor aggressively.
🏆 Published on PyPI as a Real Package
This isn't a demo or a notebook. It's a real, installable Python package: pip install compatibillabuddy. It has proper packaging (Hatchling), entry points, optional dependencies ([agent]), CI/CD on GitHub Actions, and semantic versioning. Judges can install and run it in 30 seconds.
🏆 Self-Correcting Agent
The agent doesn't just apply fixes and hope. It verifies every fix by re-running the doctor and comparing issue counts. If a fix introduced new problems, it rolls back to the snapshot and tries an alternative approach. This verify-or-rollback loop is what makes it a true autonomous agent, not a chatbot that suggests commands.
🏆 Works Without AI Too
The compatibuddy doctor command works entirely offline with zero API calls. It probes real hardware via nvidia-smi, inspects real packages via pip, and evaluates real compatibility rules from TOML rulepacks. The AI agent enhances the experience but isn't required for basic diagnosis.
🏆 Novel Framing: Hardware as a Dependency
Existing tools treat dependencies as a version graph. We treat hardware as a first-class constraint: GPU vendor, CUDA version, driver version, VRAM, and CPU architecture all participate in compatibility evaluation. This is a genuinely new approach to the ML dependency problem.
🏆 Production-Grade Safety
Dry-run by default. Virtual environment detection. Protected package blocklist. Operation limits. Snapshot-before-modify. Automatic rollback. The agent can't accidentally destroy your system Python — we designed every guardrail to prevent it.
What we learned
1. Token Budget Management is Critical for Tool-Calling Agents
The biggest lesson: what you return from tools matters as much as what the model says. Returning full data dumps (133K tokens) doesn't just slow things down — it causes the model to produce malformed responses. Designing slim, purpose-specific tool outputs is an essential skill for building reliable agents.
2. Disable Automatic Function Calling for Observability
Automatic function calling is convenient for simple use cases, but for anything requiring progress feedback, error handling, or audit logging, manual dispatch is essential. We need to see what the agent is doing, when, and why — automatic mode makes the agent a black box.
3. Gemini's Function Calling is Remarkably Good at Planning
When given structured tools with clear descriptions, Gemini consistently follows our repair protocol (snapshot → diagnose → plan → fix → verify) without being explicitly prompted at each step. It even prioritizes critical issues first and adapts its strategy when fixes fail. The planning capability is the real power of the Marathon Agent approach.
4. TDD Saves Time, Even Under Hackathon Pressure
Writing tests first felt slow at the start, but it paid off massively during debugging. When the Gemini API returned unexpected responses, we knew exactly which layer was broken because every layer had isolated tests. We never had to do a "works on my machine" debugging session.
5. The ML Dependency Problem is Worse Than We Thought
Building the knowledge base rules forced us to catalog just how many ways ML environments can break. CUDA version mismatches, NumPy ABI boundaries, framework coinstallation conflicts, deprecated APIs, driver version requirements — it's a combinatorial explosion that no resolver currently handles.
What's next for Compatibillabuddy
Short-Term (Next Release)
- PEP 751 pylock.toml export — Generate lockfiles that include hardware constraints
- PEP 817 wheel variant awareness — Detect and recommend correct wheel variants for your hardware
- More rulepacks — JAX, TensorRT, ONNX Runtime, Hugging Face Transformers
- Community rulepack repository — Let users contribute and share compatibility rules
compatibuddy explaincommand — Deep-dive explanations of specific issues
Medium-Term
- Multi-environment management — Compare environments, detect drift, sync configs
- CI/CD integration — GitHub Action that runs
compatibuddy doctorin your pipeline - VS Code extension — Inline warnings when you
pip installsomething incompatible - Conda support — Extend beyond pip to conda environments
- Windows-native GPU detection — Currently uses nvidia-smi; add WMI/DXGI fallback
Long-Term Vision
- Predictive compatibility — "If you install package X, here's what will break"
- Cross-framework migration — "Switch from TensorFlow to PyTorch" with automated dependency swaps
- Hardware recommendation — "Your model needs X VRAM, but you only have Y — here are your options"
- Industry standard — Make hardware-aware dependency resolution a first-class concept in the Python packaging ecosystem
Open Source Roadmap
Compatibillabuddy is MIT-licensed and designed for community contribution. The TOML rulepack system means anyone can add new compatibility rules without touching Python code. We plan to establish a community rulepack repository where ML practitioners can share and curate rules for their specific ecosystems.
Links
- PyPI: https://pypi.org/project/compatibillabuddy/
- GitHub: https://github.com/jemsbhai/compatibillabuddy
- Install:
pip install "compatibillabuddy[agent]"
Built With
- Python
- Gemini API (google-genai SDK)
- Pydantic v2
- Typer
- Rich
- pytest
- GitHub Actions## Inspiration
Log in or sign up for Devpost to join the conversation.