您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[英伟达]:英伟达Nemotron Nano 2:一种准确高效的混合Mamba-Transformer推理模型 - 发现报告

英伟达Nemotron Nano 2:一种准确高效的混合Mamba-Transformer推理模型

2025-08-21英伟达小***
AI智能总结
查看更多
英伟达Nemotron Nano 2:一种准确高效的混合Mamba-Transformer推理模型

NVIDIA Abstract.We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language modeldesigned to increase throughput for reasoning workloads while achieving state-of-the-art accuracycompared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture,in which the majority of the self-attention layers in the common Transformer architecture are replacedwith Mamba-2 layers, to achieve improved inference speed when generating the long thinking tracesneeded for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parametermodel (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe.Afteraligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distillthe model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10GGPU (22GiB of memory,bfloat16precision). Compared to existing similarly-sized models (e.g.,Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoningbenchmarks while achieving up to 6×higher inference throughput in reasoning settings like 8kinput and 16k output tokens (Figure 1). We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano-12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- andpost-training datasets on Hugging Face. 1. Introduction We introduce NVIDIA Nemotron Nano 2, a hybrid Mamba-Transformer reasoning model (Waleffeet al., 2024; Lieber et al., 2024; DeepMind, 2025; NVIDIA, 2025) that achieves on-par or betterbenchmark accuracies at 3×–6×higher throughput than Qwen3-8B (Yang et al., 2025) for generation-heavy scenarios like 1k input / 8k output or 8k input / 16k output tokens (Figure 1). NemotronNano 2 builds on the architecture of Nemotron-H (NVIDIA, 2025), but utilizes key new datasets andrecipes for pre-training, alignment, pruning and distillation. We share these recipes, the checkpoints,as well as the majority of the pre- and post-training datasets. The initial base model, Nemotron-Nano-12B-v2-Base, was pre-trained using FP8 precision (§2.4) over20 trillion tokens using a Warmup-Stable-Decay (Hu et al., 2024) learning rate schedule (§2.5). It thenunderwent a continuous pre-training long-context extension phase to become 128k-capable withoutdegrading other benchmarks (§2.6). Overall, new and improved datasets led to significant accuracyimprovements over Nemotron-H-8B on math, multilingual, MMLU-Pro and other benchmarks (§2.2). Nemotron Nano 2 was then post-trained through a combination of Supervised Fine-Tuning (SFT),Group Relative Policy Optimization (GRPO) (Shao et al., 2024), Direct Preference Optimization(DPO) (Rafailov et al., 2023), and Reinforcement Learning from Human Feedback (RLHF) (Ouyanget al., 2022; Christiano et al., 2017).We applied multiple SFT stages across various domains,followed by targeted SFT on key areas such as tool use, long-context performance, and truncated(budgeted) training. GRPO and RLHF sharpened instruction-following and conversational ability,while additional DPO stages further strengthened tool use. Overall, post-training was performedon roughly90billion tokens, the majority in single-turn prompt–response format with reasoning traces. About5%of the data contained deliberately truncated reasoning traces, enabling fine-grainedthinking budget control at inference time (§3.4). Finally, both the base model and aligned model were compressed so as to enable inference overcontext lengths of 128k tokens on a single NVIDIA A10G GPU (22 GiB of memory,bfloat16precision). This was done by extending a compression strategy based on Minitron (Muralidharanet al., 2024; Sreenivas et al., 2024; Taghibakhshi et al., 2025) to compress reasoning models subjectto constraints. We are releasing the following models on Hugging Face: •NVIDIA-Nemotron-Nano-9B-v2: the aligned and pruned reasoning model,•NVIDIA-Nemotron-Nano-9B-v2-Base: a pruned base model,•NVIDIA-Nemotron-Nano-12B-v2-Base: the base model before alignment or pruning. Additionally, we are releasing the majority of our pre-training dataset in theNemotron-Pre-Training-Dataset-v1collection of more than 6 trillion tokens: •Nemotron-CC-v2:Follow-up to Nemotron-CC (Su et al., 2025) with eight additionalCommon Crawl snapshots (2024–2025), synthetic rephrasing, deduplication, and syntheticQ&A data translated into 15 languages.•Nemotron-CC-Math-v1: 133B-token math dataset from Common Crawl using Lynx +LLM pipeline (Karimi Mahabadi et al., 2025a). Preserves equations, standardizes to LaTeX,outperforms previous math datasets on benchmarks.•Nemotron-Pretraining-Code-v1: Curated GitHub code references with multi-stage filtering,deduplication, and quality filters. Includes code Q&A data in 11 programming languages.•Nemotron-Pretraining-SFT-v1: Synthetic SFT-style dataset covering STEM, multilingual,academic, and reasoning domains. Finally, we are releasing an updated post-training dataset: •Nemotron-Post-Training-Dataset-v2