CaP-X: Benchmarking Coding Agents for Robot Manipulation

LMs Can Zero-Shot on Robotics Tasks —
with CaP-Agent0

Today's off-the-shelf LMs have incredible generalization, reasoning, and planning capabilities. Agentic harnesses in CaP-Agent0 unleashes their potential in the physical world.

Click on each task to see the agent in action.

Embodied Reasoning

Cross Embodiment

Loco-Manipulation

Long-Horizon Task Planning

Multimodal Reasoning

Subtask Inference

Language Following

Enhanced Visual Grounding

CaP-Bench: Evaluating LM Agents
on Embodied Intelligence

CaP-Bench provides the first comprehensive benchmark for evaluating how well large language model agents can write code to control robots. Integrated with hundreds of manipulation tasks across multiple robot learning benchmarks (LIBERO-PRO, Robosuite, BEHAVIOR), CaP-Bench tests both LLM and VLM models on their ability to generate executable robot control policies from natural language instructions.

100+ Manipulation Tasks 12+ Frontier Models Sim-to-Real Transfer Multi-Turn Evaluation Code Generation Open Source

Simulation Results

Click on each task to see the agent in action.

Cube Stack

Semantic Object Selection

Sequential Drawer Task

Precise Bottle Racking

Spill Cleanup

Stove Knob Turning

Bimanual Coordination

Key Findings

Frontier models achieve meaningful zero-shot success on robotic manipulation

Without any task-specific training, today's best frontier models can directly generate executable robot control code and achieve over 30% average success — a sharp contrast to the prior belief that only specially trained models (VLAs) can perform manipulation. Yet a 56-point gap to human performance remains, marking this as one of AI's most important open challenges.

Loading chart...

Training-free CaP-Agent0 outperforms state-of-the-art VLAs on perturbed tasks

On LIBERO-PRO — 30 manipulation tasks with position and instruction perturbations — state-of-the-art Vision-Language-Action models (OpenVLA, π₀) score 0% across the board. Even the best VLA (π_0.5) reaches only 13% average success. CaP-Agent0, a training-free coding agent, achieves 18% without any task-specific training, demonstrating that code-generation agents generalize where end-to-end learned policies break down.

Loading chart...

RL post-training on code dramatically boosts robot performance — and transfers sim-to-real

Using CaP-RL, we apply reinforcement learning with environment rewards directly on the coding agent. A 7B model (Qwen 2.5 Coder) jumps from 20% to 72% average success in simulation after just 50 training iterations. The learned policies transfer to a real Franka Emika robot with minimal sim-to-real gap — reaching 84% on cube lifting and 76% on cube stacking, approaching human expert performance.

Loading chart...

High-level abstractions boost success — but mask low-level reasoning failures

Task success rises monotonically as API abstraction increases from raw primitives (S4) to high-level pick-and-place (S1). Yet code compilation stays high even at the lowest level — revealing that the bottleneck is physical reasoning, not code correctness. This motivates evaluating agents on primitive-level performance, where success requires genuine spatial and control reasoning rather than relying on human-designed abstractions.

Loading chart...

LMs Can Zero-Shot on Robotics Tasks —with CaP-Agent0

CaP-Bench: Evaluating LM Agentson Embodied Intelligence

Simulation Results

Key Findings

Frontier models achieve meaningful zero-shot success on robotic manipulation

Training-free CaP-Agent0 outperforms state-of-the-art VLAs on perturbed tasks

RL post-training on code dramatically boosts robot performance — and transfers sim-to-real

High-level abstractions boost success — but mask low-level reasoning failures

LMs Can Zero-Shot on Robotics Tasks —
with CaP-Agent0

CaP-Bench: Evaluating LM Agents
on Embodied Intelligence