您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[PitchBook]:新兴空间简报:机器人基础模型(英)2025 - 发现报告

新兴空间简报:机器人基础模型(英)2025

机械设备2025-07-28PitchBookx***
AI智能总结
查看更多
新兴空间简报:机器人基础模型(英)2025

Originally published July 17, 2025pbinstitutionalresearch@pitchbook.comEMERGING SPACE BRIEFRobotic Foundation ModelsOverviewRobotic foundation models (RFMs) are a novel class of AI models that serve asgeneral-purpose “brains” for robots, integrating vision, language, and motion withina unified model. These models are pretrained on vast datasets, including internet-scale text, imagery, and robotic experience data, enabling them to learn richrepresentations and broad world knowledge. This approach allows RFMs to interpretcomplex commands and perform a wide range of robotic tasks with human-likegeneralization, shifting robotics toward more flexible, learning-driven systems.BackgroundThe evolution of robotics has progressed from early deterministic systems todata-driven machine learning, yet robots traditionally struggled with generalizingbeyond specific tasks. Concurrently, foundation models revolutionized AI in naturallanguage processing and computer vision, demonstrating versatile capabilitiesthrough pretraining on massive datasets. This success inspired the developmentof RFMs in the early 2020s, aiming to create unified models that learn fromextensive, multimodal data to enable robots to perform diverse tasks. Projectslike DeepMind’s Gato and Google Robotics’ PaLM-E and Robotics Transformer 2(RT-2) have showcased the potential for robots to leverage web-scale knowledgeto dynamically take different actions in novel scenarios. This paradigm shiftmarks a convergence of vision, language, and action, moving robotics fromspecialized, pipeline architectures to more flexible and robust systems capable ofbroad real-world application. 1 05101520 Technologies and processesBuilding RFMs necessitates innovation across AI algorithms, data pipelines, andspecialized hardware. Key technologies and processes include:Multimodal transformer architectures:RFMs predominantly leverage transformer-based neural networks capable of processing multiple input/output modalities.These architectures, often with billions of parameters, encode visual inputs,language commands, and sensor readings into a unified latent space, then decodeactions or task plans. For instance, Google’s RT-2 uses a high-capacity vision-language transformer to directly translate visual observations and natural languageinto low-level control signals. Similarly, Physical Intelligence’s π0novel transformer architecture with “flow matching” to learn a general policy forcontrolling different robot types.Massive and diverse data collection:Acquiring “internet-scale” robotic data iscrucial and involves innovative collection mechanisms, such as:•Fleet learning:Companies like Covariant and Ambi Robotics continuouslycollect experience from their deployed robot fleets. Ambi Robotics, for example,pretrained its PRIME-1 foundation model on over 20 million real-world imagesfrom 150,000 hours of pick-and-place operations.•Simulation-to-reality (Sim2Real) pipelines:High-fidelity simulatorsgenerate synthetic data at scale to complement real-world data. Projects likeTartanAir create diverse simulated environments for navigation datasets, andNVIDIA’s Omniverse and Isaac platforms produce physics-based syntheticdata for training, reporting 20 million hours of synthetic autonomous drivingand robot data.•Self-supervised and weakly supervised learning:RFMs primarily utilizeself-supervised objectives to learn from unlabeled data, predicting withheldinformation or aligning modalities. Ambi’s PRIME-1 leveraged self-superviseddeep learning on its massive image dataset to achieve robust 3D understanding.Techniques like language embedding of play data or using pretrained vision-language models like CLIP (contrastive language-image pretraining) in the DIAL(distributed instruction augmentation and learning) approach to auto-labelsensor data enable efficient utilization of vast, unannotated robot experience.Hardware-software codesign:The computational demands of RFMs drive thecodesign of hardware and software. Training RFMs requires powerful GPU/TPUclusters, with companies like Covariant investing in NVIDIA A100/H100 systems.For deployment, addressing latency and memory constraints is critical, leading toefforts in model compression and specialized edge AI hardware. While many RFMscurrently rely on cloud computing for heavy inference, the goal is to enable real-time inference on board robots using high-end embedded GPUs or new AI chips, asexemplified by Scout AI’s Fury defense RFM, which was designed for modularity andhardware-agnostic deployment on platforms like drones or ground vehicles. model employs a 2 ApplicationsRobotic foundation models are being used in a wide range of industries bygiving robots a better ability to understand their surroundings and performtasks more flexibly.Industrial automation and manufacturing:RFMs are fostering flexible automationand human-robot collaboration in manufacturing. For instance, a 2024 studydemonstrated a human-robot collaboration (HRC) assembly