Gemini Robotics 1.5 brings AI agents into the physical world
Captured source
source ↗Gemini Robotics 1.5 brings AI agents into the physical world — Google DeepMind Skip to main content
September 25, 2025 Models Gemini Robotics 1.5 brings AI agents into the physical world Carolina Parada
Share
We’re powering an era of physical agents — enabling robots to perceive, plan, think, use tools and act to better solve complex, multi-step tasks. Earlier this year, we made incredible progress bringing Gemini 's multimodal understanding into the physical world, starting with the Gemini Robotics family of models. Today, we’re taking another step towards advancing intelligent, truly general-purpose robots. We're introducing two models that unlock agentic experiences with advanced thinking: Gemini Robotics 1.5 – Our most capable vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task. This model thinks before taking action and shows its process, helping robots assess and complete complex tasks more transparently. It also learns across embodiments, accelerating skill learning. Gemini Robotics-ER 1.5 – Our most capable vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission. This model now achieves state-of-the-art performance across spatial understanding benchmarks.
These advances will help developers build more capable and versatile robots that can actively understand their environment to complete complex, multi-step tasks in a general way. Starting today, we’re making Gemini Robotics-ER 1.5 available to developers via the Gemini API in Google AI Studio . Gemini Robotics 1.5 is currently available to select partners. Read more about building with the next generation of physical agents on the Developer blog . Gemini Robotics 1.5: Unlocking agentic experiences for physical tasks Most daily tasks require contextual information and multiple steps to complete, making them notoriously challenging for robots today. For example, if a robot was asked, “Based on my location, can you sort these objects into the correct compost, recycling and trash bins?" it would need to search for relevant local recycling guidelines on the internet, look at the objects in front of it and figure out how to sort them based on those rules — and then do all the steps needed to completely put them away. So, to help robots complete these types of complex, multi-step tasks, we designed two models that work together in an agentic framework. Our embodied reasoning model, Gemini Robotics-ER 1.5, orchestrates a robot’s activities, like a high-level brain. This model excels at planning and making logical decisions within physical environments. It has state-of-the-art spatial understanding, interacts in natural language, estimates its success and progress, and can natively call tools like Google Search to look for information or use any third-party user-defined functions. Gemini Robotics-ER 1.5 then gives Gemini Robotics 1.5 natural language instructions for each step, which uses its vision and language understanding to directly perform the specific actions. Gemini Robotics 1.5 also helps the robot think about its actions to better solve semantically complex tasks, and can even explain its thinking processes in natural language — making its decisions more transparent.
Diagram showing how our embodied reasoning model, Gemini Robotics-ER 1.5, and our vision-language-action model, Gemini Robotics 1.5, actively work together to perform complex tasks in the physical world.
Both of these models are built on the core Gemini family of models and have been fine-tuned with different datasets to specialize in their respective roles. When combined, they increase the robot’s ability to generalize to longer tasks and more diverse environments.
Understands its environment Gemini Robotics-ER 1.5 is the first thinking model optimized for embodied reasoning. It achieves state-of-the-art performance on both academic and internal benchmarks, inspired by real-world use cases from our trusted tester program. We evaluated Gemini Robotics-ER 1.5 on 15 academic benchmarks including Embodied Reasoning Question Answering (ERQA) and Point-Bench , measuring the model’s performance on pointing, image question answering and video question answering. See details in our tech report .
Bar graph showing Gemini Robotics-ER 1.5’s state-of-the-art performance results compared to similar models. Our model achieves the highest aggregated performance on 15 academic embodied reasoning benchmarks, including Point-Bench, RefSpatial, RoboSpatial-Pointing, Where2Place, BLINK, CV-Bench, ERQA, EmbSpatial, MindCube, RoboSpatial-VQA, SAT, Cosmos-Reason1, Min Video Pairs, OpenEQA and VSI-Bench.
Your browser does not support the video tag. Your browser does not support the video tag.
A collage of GIFs showing some of Gemini Robotics-ER 1.5’s capabilities, including object detection and state estimation, segmentation mask, pointing, trajectory prediction and task progress estimation and success detection.
Thinks before acting Vision-language-action models traditionally translate instructions or linguistic plans directly into a robot’s movement. Beyond simply translating instructions or plans, Gemini Robotics 1.5, can now think before taking action. This means it can generate an internal sequence of reasoning and analysis in natural language to perform tasks that require multiple steps or require a deeper semantic understanding. For example, when completing a task like, “Sort my laundry by color,” the robot in the video below thinks at different levels. First, it understands that sorting by color means putting the white clothes in the white bin and other colors in the black bin. Then it thinks about steps to take, like picking up the red sweater and putting it in the black bin, and about the detailed motion involved, like moving a sweater closer to pick it up more easily.
During this multi-level thinking process, the vision-language-action model can decide to turn longer tasks into simpler shorter segments that the robot can execute successfully. It also helps the model generalize to solve new tasks and be more robust to changes in its environment. Learns across embodiments Robots come in all shapes and sizes, and have different sensing capabilities and different degrees of freedom, making it difficult to transfer motions learned from one robot to another. Gemini Robotics 1.5 shows a…
Excerpt shown — open the source for the full document.
Notability
notability 8.0/10Major robotics AI model release from top lab