Today's off-the-shelf LMs have incredible generalization, reasoning, and planning capabilities. Agentic harnesses in CaP-Agent0 unleashes their potential in the physical world.
Click on each task to see the agent in action.
CaP-Bench provides the first comprehensive benchmark for evaluating how well large language model agents can write code to control robots. Integrated with hundreds of manipulation tasks across multiple robot learning benchmarks (LIBERO-PRO, Robosuite, BEHAVIOR), CaP-Bench tests both LLM and VLM models on their ability to generate executable robot control policies from natural language instructions.
Click on each task to see the agent in action.
Without any task-specific training, today's best frontier models can directly generate executable robot control code and achieve over 30% average success — a sharp contrast to the prior belief that only specially trained models (VLAs) can perform manipulation. Yet a 56-point gap to human performance remains, marking this as one of AI's most important open challenges.
On LIBERO-PRO — 30 manipulation tasks with position and instruction perturbations — state-of-the-art Vision-Language-Action models (OpenVLA, π0) score 0% across the board. Even the best VLA (π0.5) reaches only 13% average success. CaP-Agent0, a training-free coding agent, achieves 18% without any task-specific training, demonstrating that code-generation agents generalize where end-to-end learned policies break down.
Using CaP-RL, we apply reinforcement learning with environment rewards directly on the coding agent. A 7B model (Qwen 2.5 Coder) jumps from 20% to 72% average success in simulation after just 50 training iterations. The learned policies transfer to a real Franka Emika robot with minimal sim-to-real gap — reaching 84% on cube lifting and 76% on cube stacking, approaching human expert performance.
Task success rises monotonically as API abstraction increases from raw primitives (S4) to high-level pick-and-place (S1). Yet code compilation stays high even at the lowest level — revealing that the bottleneck is physical reasoning, not code correctness. This motivates evaluating agents on primitive-level performance, where success requires genuine spatial and control reasoning rather than relying on human-designed abstractions.