Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents

Jilin University, Harvard University, Massachusetts Institute of Technology,
Huazhong University of Science and Technology, Southern University of Science and Technology,
Lehigh University, Shanghai Jiao Tong University

*Corresponding Author
Teaser Image

Overview of the Agentic Robot framework governed by Standardized Action Procedure (SAP). (1) A high-level task is decomposed into structured subgoals by an LRM-based planner, guided by a skill library. (2) A VLA policy (Executor) executes each subgoal using natural language instructions and real-time visual input. (3) A VLM-based verifier periodically inspects a sliding window of third-person and wrist-mounted views to determine whether to continue, retry, or recover. This SAP-driven agentic loop enables robust, interpretable, and feedback-driven manipulation.

MY ALT TEXT

Comparison between OpenVLA and Agentic Robot on the task ``Put the cream cheese in the bowl.'' Top: OpenVLA fails to grasp the object, causing the gripper to collide with the table and the task to fail. Bottom: Agentic Robot decomposes the task into subgoals and detects failure via visual verification. It issues a recovery action (Lift the gripper) and completes the task through retry. Green boxes show VLM verifier decisions, and orange boxes indicate VLA executor instructions.

Abstract

Long-horizon robotic manipulation poses significant challenges for autonomous systems, requiring extended reasoning, precise execution, and robust error recovery across complex sequential tasks. Current approaches, whether based on static planning or end-to-end visuomotor policies, suffer from error accumulation and lack effective verification mechanisms during execution, limiting their reliability in real-world scenarios. We present Agentic Robot, a brain-inspired framework that addresses these limitations through Standardized Action Procedures (SAP)—a novel coordination protocol governing component interactions throughout manipulation tasks. Drawing inspiration from Standardized Operating Procedures (SOPs) in human organizations, SAP establishes structured workflows for planning, execution, and verification phases. Our architecture comprises three specialized components: (1) a large reasoning model that decomposes high-level instructions into semantically coherent subgoals, (2) a vision-language-action executor that generates continuous control commands from real-time visual inputs, and (3) a temporal verifier that enables autonomous progression and error recovery through introspective assessment. This SAP-driven closed-loop design supports dynamic self-verification without external supervision. On the LIBERO benchmark, Agentic Robot achieves state-of-the-art performance with an average success rate of 79.6%, outperforming SpatialVLA by 6.1% and OpenVLA by 7.4% on long-horizon tasks. These results demonstrate that SAP-driven coordination between specialized components enhances both performance and interpretability in sequential manipulation, suggesting significant potential for reliable autonomous systems.

Experimental Results

Demonstration Videos

BibTeX

@misc{yang2025agenticrobotbraininspiredframework,
      title={Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents},
      author={Zhejian Yang and Yongchao Chen and Xueyang Zhou and Jiangyue Yan and Dingjie Song and Yinuo Liu and Yuting Li and Yu Zhang and Pan Zhou and Hechang Chen and Lichao Sun},
      year={2025},
      eprint={2505.23450},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.23450},
}