行业研究公司研究宏观策略财报招股书会议纪要中央经济工作会议低空经济 DeepSeek AIGC 智能驾驶大模型

cosmos世界基础模型平台：面向物理ai的世界基础模型平台

2025-01-07NVIDIA王***

AI智能总结

核心观点：该研报介绍了由 NVIDIA 开发的 Cosmos World Foundation Model (WFM) 平台，旨在帮助开发者构建定制的物理 AI 应用。WFM 平台通过构建数字孪生的物理世界模型，解决物理 AI 训练数据难以扩展的问题，从而推动物理 AI 的发展。

关键数据：

平台包含视频管理流程、预训练的 WFM 模型、后训练示例和视频分词器。
视频管理流程从 20M 小时的视频集合中提取了约 1 亿个视频片段，并使用视觉语言模型 (VLM) 为每个片段生成视频描述。
平台提供了两种可扩展的 WFM 构建方法：基于扩散的 WFM 和基于自回归的 WFM。
平台包含多种不同容量和压缩率的模型，并研究了它们在各种下游应用中的有效性。
平台开发了强大的护栏系统，以防止有害输入和输出。

研究结论：

WFM 平台可以帮助物理 AI 开发者高效地构建定制的世界模型，并加速物理 AI 系统的开发。
预训练的 WFM 模型可以作为通用模型，通过微调适应不同的物理 AI 应用。
平台提供的视频管理流程、视频分词器和护栏系统，可以有效地提高物理 AI 开发的效率和安全性。

主要功能：

视频管理流程：将长视频分割成片段，过滤低质量片段，使用 VLM 生成视频描述，并进行语义去重和分片。
视频分词器：将视频转换为紧凑的语义标记，支持图像和视频的联合训练，并具有不同的压缩率。
WFM 预训练：探索了基于扩散和自回归的两种 WFM 构建方法，并使用大规模视频数据集进行训练。
WFM 后训练：将预训练的 WFM 模型微调，以适应不同的物理 AI 应用，例如相机控制、机器人操作和自动驾驶。
护栏系统：防止有害输入和输出，确保 WFM 的安全使用。

NVIDIA1 Abstract Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and adigital twin of the world, the world model. In this paper, we present the Cosmos World Foundation ModelPlatform to help developers build customized world models for their Physical AI setups. We positiona world foundation model as a general-purpose world model that can be fine-tuned into customizedworld models for downstream applications. Our platform covers a video curation pipeline, pre-trainedworld foundation models, examples of post-training of pre-trained world foundation models, and videotokenizers. To help Physical AI builders solve the most critical problems of our society, we make ourplatform open-source and our models open-weight with permissive licenses available via NVIDIA Cosmos. 1. Introduction Physical AI is an AI system equipped with sensors and actuators: the sensors allow it to observe the world,and the actuators allow it to interact with and modify the world. It holds the promise of freeing humanworkers from physical tasks that are dangerous, laborious, or tedious. While several fields of AI have advancedsignificantly thanks to data and compute scaling in the recent decade, Physical AI only inches forward. Thisis largely because scaling training data for Physical AI is much more challenging, as the desired data mustcontain sequences of interleaved observations and actions. These actions perturb the physical world and maycause severe damage to the system and the world. This is especially true when the AI is still in its infancy whenexploratory actions are essential. A World Foundation Model (WFM), a digital twin of the physical world that aPhysical AI can safely interact with, has been a long-sought remedy to the data scaling problem. In this paper, we introduce the Cosmos World Foundation Model (WFM) Platform for building Physical AI.We are mainly concerned with the visual world foundation model, where the observations are presented asvideos, and the perturbations can exist in various forms. As illustrated in Fig. 2, we present a pre-training-and-then-post-training paradigm, where we divide WFMs into pre-trained and post-trained WFMs. To builda pre-trained WFM, we leverage a large-scale video training dataset to expose the model to a diverse set ofvisual experiences so it can become a generalist. To build a post-trained WFM, we fine-tune the pre-trainedWFM to arrive at a specialized WFM using a dataset collected from a particular Physical AI environment for thetargeted, specialized Physical AI setup. Fig. 1 shows example results from our pre-trained and post-trainedWFMs. Data determines the ceiling of an AI model. To build a high-ceiling pre-trained WFM, we develop a video datacuration pipeline. We use it to locate portions of videos with rich dynamics and high visual quality that facilitatelearning of physics encoded in visual content. We use the pipeline to extract about 100M clips of videos rangingfrom 2 to 60 seconds from a 20M hour-long video collection. For each clip, we use a visual language model(VLM) to provide a video caption per 256 frames. Video processing is computationally intensive. We leveragehardware implementations of the H.264 video encoder and decoder available in modern GPUs for decoding andtranscoding. Our video data curation pipeline leverages many pre-trained image/video understanding models.These models have different throughputs. To maximize the overall throughput for generating trainable videodata, we build a Ray-based orchestration pipeline (Moritz et al., 2017). The details are described in Sec. 3. We explore two scalable approaches for building pre-trained WFMs discussed in Sec. 5. These approaches are Pre-training: Diffusion WFM transformer-based diffusion models and transformer-based autoregressive models. A diffusion model generatesvideos by gradually removing noise from a Gaussian noise video. An autoregressive model generates videospiece by piece, conditioned on the past generations following a preset order. Both approaches decompose adifficult video generation problem into easier sub-problems, making it more tractable. We leverage state-of-the-art transformer architectures for their scalability. In Sec. 5.1, we present a transformer-based diffusionmodel design that exhibits strong world-generation capabilities. In Sec. 5.2, we present a transformer-basedautoregressive model design for world generation. Both the transformer-based diffusion model and transformer-based autoregressive model use tokens as rep-resentations of videos, where the former uses continuous tokens in the form of vectors, and the latter usesdiscrete tokens in the form of integers. We note that tokenization for videos—a process that transforms videosinto a set of tokens—is highly nontrivial. Video contains rich information about the visual world. However, tofacilitate learning of the WFMs, we need to compress videos into sequences of compact tokens while

点击免费查看完整报告

你可能感兴趣

cosmos世界基础模型平台：面向物理ai的世界基础模型平台

你可能感兴趣

英伟达Cosmos世界基础模型平台物理人工智能研究报告

Cosmos世界基础模型平台：物理人工智能研究报告

物理AI和世界模型20251021

智能汽车系列报告（一）：小鹏科技日前瞻：物理AI与世界模型或有突破

计算机行业周报：CES2025英伟达发布RTX5090，Cosmos推动AI进入“物理人工智能”时代