行业研究公司研究宏观策略财报招股书会议纪要海南封关低空经济 DeepSeek AIGC 大模型

小型训练手册：构建世界级大语言模型的秘诀

信息技术2025-10-30-未知机构L***

AI智能总结

本文深入探讨了大规模语言模型（LLM）的训练过程，从架构设计、数据混合到超参数调整，以及如何构建一个高效稳定的训练系统。文章强调了实验框架的重要性，并通过一系列消融实验验证了每个设计决策，最终成功训练出 SmolLM3 模型。

核心观点：

实验框架的重要性：通过系统性的消融实验，可以验证每个设计决策，确保模型的每一步都经过测试和优化。
架构选择：文章比较了不同的注意力机制、嵌入策略和位置编码方案，并通过实验确定了最适合 SmolLM3 的配置。
数据混合：文章讨论了如何选择和混合不同的数据集，并通过实验确定了 SmolLM3 的最佳数据混合比例。
超参数调整：文章介绍了不同的优化器和学习率调度方案，并通过实验确定了 SmolLM3 的最佳超参数设置。
训练监控：文章强调了训练过程中监控关键指标的重要性，并通过一系列工具和技术来确保训练的稳定性。
多阶段训练：文章介绍了多阶段训练的概念，并通过实验验证了其在提高模型性能方面的有效性。

关键数据和研究结论：

SmolLM3 模型参数为 3B，训练数据量为 11T tokens，使用了 GQA 注意力机制、NoPE 位置编码方案和文档掩码等技术。
SmolLM3 在多个下游任务上取得了优异的性能，包括数学、代码、多语言理解和长文本处理。
文章还介绍了如何使用偏好优化和强化学习等技术来进一步提高模型性能。

总结：

本文为 LLM 训练提供了全面的指南，涵盖了从架构设计、数据混合到超参数调整和训练监控等各个方面。通过系统性的实验和优化，可以构建出高效稳定的 LLM 模型。

AUTHORS Loubna Ben Allal,Lewis Tunstall,Nouamane Tazi,Elie Bakouch,Ed Beeching,Carlos Miguel Patiño,Clémentine Fourrier,Thibaud Frere,Anton Lozhkov,Colin Raffel,Leandro von Werra,Thomas Wolf Introduction What does it actually take to train a high-performance LLM today? Published research makes it look straightforward: strategic architecture choices, carefully curated datasets, and sufficientcompute. The results are polished, the ablations are structured and clean. Every decision seems obvious in hindsight. Butthose reports only show what worked and apply a bit of rosy retrospection – they don’t capture the 2am dataloaderdebugging sessions, the loss spikes, or the subtle tensor parallelism bug (see later!) that quietly sabotages your training.The reality is messier, more iterative, and full of decisions that don’t make it into the final paper. Join us as we look behind the scenes of trainingSmolLM3, a 3B multilingual reasoning model trained on 11T tokens. Thisis not an ordinary blog post, but rather the untangling of a spiderweb of decisions, discoveries, and dead ends that led todeep insights into what it takes to build world-class language models. It is also the final opus in our model-training long-form series: we’ve worked through building datasets at scale (FineWeb),orchestrating thousands of GPUs to sing in unison (Ultra Scale Playbook), and selecting the best evaluations at each stepof the process (Evaluation Guidebook). Now we shape it all together to build a strong AI model. We’ll walk you through thecomplete journey – not just the final recipe that worked, but the failures, infrastructure breakdowns, and debuggingprocesses that shaped every decision. The story reads like a drama: you’ll see how promising small-scale ablations sometimes don’t translate at scale, why werestarted a training after 1T tokens, how we balanced the competing objectives of multilinguality, math, and code whilemaintaining strong English performance, and finally how we post-trained a hybrid reasoning model. We also tried to avoid a cold list of all we did in favour of an organized story through our adventure. Think of this as a guidefor anyone trying to go from “we have a great dataset and GPUs” to “we built a really strong model”. We hope being thisopen will help close the gap between research and production, and make your next training run a little less chaotic. How to read this blog post You don’t need to read this blog post from top to bottom, and at this point it’s too long to realistically read end-to-end inone sitting anyway. The blog post is structured in several distinct pieces that can be skipped or read individually: Training compass: A high-level discussion about whether or not you should pretrain your own model. We walk youthrough fundamental questions to ask yourself before burning through all your VC money, and how to thinksystematically through the decision process. This is a high-level section, if you want to skip straight to the technicalcontent, scroll quickly past this part. Pretraining: The sections following the training compass cover everything you need to know to build a solid recipe foryour own pretraining run: how to run ablations, select evaluations, mix data sources, make architecture choices, tunehyperparameters, and finally endure the training marathon. This section also applies if you’re not planning to pretrainfrom scratch but are interested in continued pretraining (aka mid-training). Post-training: In this part of the blog you’ll learn all the tricks needed to get most out of your pretrained models. Learnthe whole post-training alphabet starting with SFT, DPO and GRPO as well as the dark arts and alchemy of modelmerging. Most of the knowledge about making these algorithms work well is learned through painful lessons, and we’llshare our experience here to hopefully spare you some of them. Infrastructure: If pretraining is the cake and post-training is the icing and cherry on top, then infrastructure is theindustrial-grade oven. Without it, nothing happens, and if it’s broken, your happy Sunday baking session turns into a firehazard. Knowledge about how to understand, analyse, and debug GPU clusters is scattered across the internet invarious libraries, docs, and forums. This section walks through GPU layout, communication patterns betweenCPU/GPU/nodes/storage, and how to identify and overcome bottlenecks. So where do we even start? Pick the section that you find most exciting and let’s go! Training compass: why → what → how The field of machine learning has an obsessive relationship with optimisation. We fixate on loss curves, modelarchitectures, and throughput; after all, machine learning is fundamentally about optimising the loss function of a model.Yet before diving into these technical details, there’s a more fundamental question that often goes unasked:should weeven be training this model? As shown in the heatmap below, the open-source AI ecosystem releases world-class mo

点击免费查看完整报告

你可能感兴趣

小型训练手册：构建世界级大语言模型的秘诀

你可能感兴趣

从零开始构建大语言模型的关键要点

从零开始构建大语言模型的关键要点

大语言模型原理、训练及应用（基于GPT）

王昕-LLMOps，基于大语言模型构建智能应用的新模式

人工智能系列二：基于大语言模型的多信源舆情指数构建与应用