Inspiration

We've all built at the cutting edge seeing how ai agents are deployed in fast moving YC startups to large enterprises like Tesla & Amazon. The question we kept running into: can a cheaper, specialist model (an SLM) match or beat a large, general LLM on a team’s actual agentic workflow? Most teams don’t have the time or infrastructure to verify that.

So we set out to make verification and conversion practical: capture your agent’s real traces, prove parity (or wins), and automatically fine-tune a specialist SLM to replace the LLM. We decided to go ahead and build out model finetuning for any workflow.

What it does

Fledgling is an autonomous fine-tuning pipeline that converts expensive LLM-powered agents into fast, reliable, and low-cost SLM specialists.

Drop-in tracing (≈10 LOC): Wrap your Mastra/Langfuse-instrumented agents to capture prompts, tool calls (with schemas), intermediate reasoning, and observations.

Baseline vs Candidate: Compare your production LLM (e.g., GPT/Claude) to an SLM (e.g., Llama-3, Phi-4, Qwen-2.5) on your agent traces.

Decision & Training: If parity is achievable, kick off QLoRA fine-tuning (Unsloth) on the captured trajectories—focusing not just on input→output, but the action/JSON/tool-call steps that matter for agents.

Deploy & Monitor: Swap in the SLM behind a safety gate; keep observability in Langfuse; re-eval over time to prevent drift.

Principle: If you’re already spending to stay ahead, turn your data moat into specialist SLMs—and slash inference costs without losing reliability.

How we built it

We split the work across two repos (UI/tracing + ML pipeline):

1) Agent Observability + DX (Repo A)

Integrations: Mastra + Langfuse (+ OpenTelemetry under the hood) to capture agentic thought → action → observation with tool schemas.

Tracer: withMastraTracing() adapter (≈10 LOC integration) that auto-instruments agents and forwards full traces.

Web App: React 19 + Vite UI and an Express/TypeScript backend to:

List agents and traces,

Visualize tool calls & nested observations,

Compare LLM vs SLM runs,

Provide a playground.

Why this matters: Developers shouldn’t need to re-platform. Fledgling sits on your existing observability and makes “collect → compare → (auto) fine-tune” nearly zero-friction.

2) Fine-Tuning & Evaluation (Repo B)

Data prep: prepare_data.py converts structured datasets (e.g., Salesforce xLAM function-calling) and real traces into tool-calling/JSON supervision samples.

Evaluation harness: eval.py computes JSON validity, tool-call accuracy, and exact-match metrics and logs everything to Langfuse.

Decision logic: compare.py checks performance gaps to decide if an SLM fine-tune is justified.

Training: train_unsloth.py runs QLoRA (Unsloth) on Phi-4 / Llama-3 bases with sensible defaults (e.g., r=64, α=16, lr=2e-4). Designed to run on local rigs or Azure AI Foundry.

Ops niceties: CLI flags, dataset resolution checks, GPU-health preflight (fast fail if nvidia-smi/NVML is down), and runnable scripts to resume jobs.

Note: our fine-tuning repo on Git is currently ~14 hours behind local due to the crash (details below). Code up to that point is pushed and runnable; the most recent training changes/checkpoints didn’t make it to Git in time.

Challenges we ran into

Observability is unforgiving. Getting OpenTelemetry→Langfuse traces with nested tool calls + schemas (without losing context) took real elbow grease.

Autonomous self-training is hard. Abstracting “any agent → distill into SLM” without bespoke glue is non-trivial; we leaned on adapters and strict JSON/tool schemas to generalize.

The 4 AM hardware catastrophe. We made major progress and ran evals showing fine-tuned SLMs could meet or beat LLMs on structured/tool-calling tasks. Then at ~4 AM, the local 4×3090 training rig one of team members had recently built for hosting local LLMs hard-froze (NVML/Xid 79 “GPU has fallen off the bus”). It auto-rebooted… but not before 8 AM. We lost ~12 hours of unpushed fine-tuning changes and eval artifacts. (Picture of machine in image gallery).

Lesson learned: Commit early and often. Remote checkpoints. Push logs/screenshots to Langfuse. (We included a photo of the machine in the gallery.)

Three product questions we set out to solve—and how we addressed them:

How do we avoid a weeks-long setup that makes teams wish they’d built it from scratch? By enforcing a 10-line integration target and shipping a withMastraTracing() adapter that works out-of-the-box with Langfuse.

How do we make both developers and non-technical stakeholders happy? Devs get schema-validated tool calls, traces, and CLIs; non-technical users get a clean UI with side-by-side comparisons and cost/perf summaries.

How do we abstract a self-training system that distills into an SLM for any agentic app? We standardized on thought/action/observation traces, JSON schemas, and a decision engine (eval → compare → auto-train) that’s framework-agnostic.

Despite the initial learning curve, we took ownership, killed bad ideas fast, and iterated. One sleepless night later, we had adapters ingesting Mastra details, auto-trigger hooks for training thresholds, and an architecture that reduces configuration burden while staying extensible.

Accomplishments that we're proud of

10-line drop-in tracing for Mastra with full tool-call schema capture.

A clean UI for agents, traces, and side-by-side comparisons.

A complete evaluation harness for structured outputs + tool calling, logged to Langfuse.

A working QLoRA fine-tuning pipeline (Unsloth) with preflight GPU checks and resume scripts.

Pre-crash evals demonstrating SLM parity and small wins vs an LLM on tool-call accuracy and JSON validity.

An architecture that keeps logs in Langfuse (cost-aware) with clear hooks for richer analytics.

What we learned

ALWAYS COMMIT EARLY AND OFTEN. Measure first. Full agent traces turn vibes into facts.

DX constraints help. “10 LOC” forced clean APIs and adapters.

Adapters > one-offs. Mastra first, but the pattern generalizes.

Small models are viable. 7B–8B SLMs compete on structured/tool-calling tasks at a tiny fraction of the cost.

Ops hygiene matters. Remote checkpoints + frequent commits, especially for GPU training.

What's next for Fledgling

If we are to deploy AI for the average consumer or not just the largest player in the room, we need to be able to be optimize at the cost frontier.

Built With

Share this project:

Updates