您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[Google]:Gemini 2.5:推动前沿,具备先进推理、多模态、长上下文及下一代智能体能力 - 发现报告

Gemini 2.5:推动前沿,具备先进推理、多模态、长上下文及下一代智能体能力

2025-07-11GoogleG***
AI智能总结
查看更多
Gemini 2.5:推动前沿,具备先进推理、多模态、长上下文及下一代智能体能力

helpfulness and general tone compared to their 2.0 and 1.5 counterparts. In practice, this means thatthe 2.5 models are substantially better at providing safe responses without interfering with importantuse cases or lecturing end users. We also evaluated Gemini 2.5 Pro’s Critical Capabilities, includingCBRN, cybersecurity, machine learning R&D, and deceptive alignment. While Gemini 2.5 Pro showeda significant increase in some capabilities compared to previous Gemini models, it did not reach anyof the Critical Capability Levels in any area.Our report is structured as follows: we begin by briefly describing advances we have made inmodel architecture, training and serving since the release of the Gemini 1.5 model. We then showcasethe performance of the Gemini 2.5 models, including qualitative demonstrations of its abilities. Weconclude by discussing the safety evaluations and implications of this model series.2. Model Architecture, Training and Dataset2.1. Model ArchitectureThe Gemini 2.5 models are sparse mixture-of-experts (MoE) (Clark et al., 2022; Du et al., 2021;Fedus et al., 2021; Jiang et al., 2024; Lepikhin et al., 2020; Riquelme et al., 2021; Roller et al., 2021;Shazeer et al., 2017) transformers (Vaswani et al., 2017) with native multimodal support for text,vision, and audio inputs. Sparse MoE models activate a subset of model parameters per input tokenby learning to dynamically route tokens to a subset of parameters (experts); this allows them todecouple total model capacity from computation and serving cost per token. Developments to themodel architecture contribute to the significantly improved performance of Gemini 2.5 compared toGemini 1.5 Pro (see Section 3). Despite their overwhelming success, large transformers and sparseMoE models are known to suffer from training instabilities (Chowdhery et al., 2022; Dehghani et al.,2023; Fedus et al., 2021; Lepikhin et al., 2020; Liu et al., 2020; Molybog et al., 2023; Wortsmanet al., 2023; Zhai et al., 2023; Zhang et al., 2022). The Gemini 2.5 model series makes considerableprogress in enhancing large-scale training stability, signal propagation and optimization dynamics,resulting in a considerable boost in performance straight out of pre-training compared to previousGemini models.2 Gemini 2.5 models build on the success of Gemini 1.5 in processing long-context queries, andincorporate new modeling advances allowing Gemini 2.5 Pro to surpass the performance of Gemini1.5 Pro in processing long context input sequences of up to 1M tokens (see Table 3). Both Gemini 2.5Pro and Gemini 2.5 Flash can process pieces of long-form text (such as the entirety of “Moby Dick” or“Don Quixote”), whole codebases, and long form audio and video data (see Appendix 8.5). Togetherwith advancements in long-context abilities, architectural changes to Gemini 2.5 vision processinglead to a considerable improvement in image and video understanding capabilities, including beingable to process 3-hour-long videos and the ability to convert demonstrative videos into interactivecoding applications (see our recent blog post by Baddepudi et al., 2025).The smaller models in the Gemini 2.5 series — Flash size and below — use distillation (Anil et al.,2018; Hinton et al., 2015), as was done in the Gemini 1.5 series (Gemini Team, 2024). To reducethe cost associated with storing the teacher’s next token prediction distribution, we approximate itusing a k-sparse distribution over the vocabulary. While this still increases training data throughputand storage demands by a factor of k, we find this to be a worthwhile trade-off given the significantquality improvement distillation has on our smaller models, leading to high-quality models with areduced serving cost (see Figure 2).2.2. DatasetOur pre-training dataset is a large-scale, diverse collection of data encompassing a wide range ofdomains and modalities, which includes publicly available web documents, code (various programminglanguages), images, audio (including speech and other audio types) and video, with a cutoff dateof June 2024 for 2.0 and January 2025 for 2.5. Compared to the Gemini 1.5 pre-training dataset3 we also utilized new methods for improved data quality for both filtering, and deduplication. Ourpost-training dataset, like Gemini 1.5, consists of instruction tuning data that is carefully collectedand vetted. It is a collection of multimodal data with paired instructions and responses, in addition tohuman preference and tool-use data.2.3. Training InfrastructureThis model family is the first to be trained on TPUv5p architecture. We employed synchronousdata-parallel training to parallelise over multiple 8960-chip pods of Google’s TPUv5p accelerators,distributed across multiple datacenters.The main advances in software pre-training infrastructure compared with Gemini 1.5 were relatedto elasticity and mitigation of SDC (Silent Data Corruption) errors:1.Slice-Granularity Elasticity: Our system now automatically continues training w