AI智能总结
NVIDIA1 Abstract Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and adigital twin of the world, the world model. In this paper, we present the Cosmos World Foundation ModelPlatform to help developers build customized world models for their Physical AI setups. We positiona world foundation model as a general-purpose world model that can be fine-tuned into customizedworld models for downstream applications. Our platform covers a video curation pipeline, pre-trainedworld foundation models, examples of post-training of pre-trained world foundation models, and videotokenizers. To help Physical AI builders solve the most critical problems of our society, we make ourplatform open-source and our models open-weight with permissive licenses available via NVIDIA Cosmos. 1. Introduction Physical AI is an AI system equipped with sensors and actuators: the sensors allow it to observe the world,and the actuators allow it to interact with and modify the world. It holds the promise of freeing humanworkers from physical tasks that are dangerous, laborious, or tedious. While several fields of AI have advancedsignificantly thanks to data and compute scaling in the recent decade, Physical AI only inches forward. Thisis largely because scaling training data for Physical AI is much more challenging, as the desired data mustcontain sequences of interleaved observations and actions. These actions perturb the physical world and maycause severe damage to the system and the world. This is especially true when the AI is still in its infancy whenexploratory actions are essential. A World Foundation Model (WFM), a digital twin of the physical world that aPhysical AI can safely interact with, has been a long-sought remedy to the data scaling problem. In this paper, we introduce the Cosmos World Foundation Model (WFM) Platform for building Physical AI.We are mainly concerned with the visual world foundation model, where the observations are presented asvideos, and the perturbations can exist in various forms. As illustrated in Fig. 2, we present a pre-training-and-then-post-training paradigm, where we divide WFMs into pre-trained and post-trained WFMs. To builda pre-trained WFM, we leverage a large-scale video training dataset to expose the model to a diverse set ofvisual experiences so it can become a generalist. To build a post-trained WFM, we fine-tune the pre-trainedWFM to arrive at a specialized WFM using a dataset collected from a particular Physical AI environment for thetargeted, specialized Physical AI setup. Fig. 1 shows example results from our pre-trained and post-trainedWFMs. Data determines the ceiling of an AI model. To build a high-ceiling pre-trained WFM, we develop a video datacuration pipeline. We use it to locate portions of videos with rich dynamics and high visual quality that facilitatelearning of physics encoded in visual content. We use the pipeline to extract about 100M clips of videos rangingfrom 2 to 60 seconds from a 20M hour-long video collection. For each clip, we use a visual language model(VLM) to provide a video caption per 256 frames. Video processing is computationally intensive. We leveragehardware implementations of the H.264 video encoder and decoder available in modern GPUs for decoding andtranscoding. Our video data curation pipeline leverages many pre-trained image/video understanding models.These models have different throughputs. To maximize the overall throughput for generating trainable videodata, we build a Ray-based orchestration pipeline (Moritz et al., 2017). The details are described in Sec. 3. We explore two scalable approaches for building pre-trained WFMs discussed in Sec. 5. These approaches are Pre-training: Diffusion WFM transformer-based diffusion models and transformer-based autoregressive models. A diffusion model generatesvideos by gradually removing noise from a Gaussian noise video. An autoregressive model generates videospiece by piece, conditioned on the past generations following a preset order. Both approaches decompose adifficult video generation problem into easier sub-problems, making it more tractable. We leverage state-of-the-art transformer architectures for their scalability. In Sec. 5.1, we present a transformer-based diffusionmodel design that exhibits strong world-generation capabilities. In Sec. 5.2, we present a transformer-basedautoregressive model design for world generation. Both the transformer-based diffusion model and transformer-based autoregressive model use tokens as rep-resentations of videos, where the former uses continuous tokens in the form of vectors, and the latter usesdiscrete tokens in the form of integers. We note that tokenization for videos—a process that transforms videosinto a set of tokens—is highly nontrivial. Video contains rich information about the visual world. However, tofacilitate learning of the WFMs, we need to compress videos into sequences of compact tokens while