您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[佐治亚大学&德克萨斯大学阿灵顿分校]:大型语言模型的知识蒸馏与数据集蒸馏:新兴趋势、挑战与未来方向 - 发现报告

大型语言模型的知识蒸馏与数据集蒸馏:新兴趋势、挑战与未来方向

AI智能总结
查看更多
大型语言模型的知识蒸馏与数据集蒸馏:新兴趋势、挑战与未来方向

compact, high-impact datasets through optimization-based gradient matching, latentspace regularization, and generative synthesis. Building on these foundations, we ex-plore how integrating KD and DD can produce more effective and scalable compressionstrategies. Together, these approaches address persistent challenges in model scalabil-ity, architectural heterogeneity, and the preservation of emergent LLM abilities.We further highlight applications across domains such as healthcare and education,where distillation enables efficient deployment without sacrificing performance. Despitesubstantial progress, open challenges remain in preserving emergent reasoning and lin-guistic diversity, enabling efficient adaptation to continually evolving teacher modelsand datasets, and establishing comprehensive evaluation protocols.By synthesizingmethodological innovations, theoretical foundations, and practical insights, our surveycharts a path toward sustainable, resource-efficient LLMs through the tighter integra-tion of KD and DD principles.Keywords:Large Language Models, Knowledge Distillation, Dataset Distillation, Efficiency,Model Compression, Survey1IntroductionThe emergence of Large Language Models (LLMs) like GPT-4 (Brown et al., 2020), DeepSeek(Guo et al., 2025), and LLaMA (Touvron et al., 2023) has transformed natural languageprocessing, enabling unprecedented capabilities in tasks like translation, reasoning, and textgeneration. Despite these landmark achievements, these advancements come with significantchallenges that hinder their practical deployment. First, LLMs demand immense computa-tional resources, often requiring thousands of GPU hours for training and inference, whichtranslates to high energy consumption and environmental costs.Second, their reliance onmassive training datasets raises concerns about data efficiency, quality, and sustainability,as public corpora become overutilized and maintaining diverse, high-quality data becomesincreasingly difficult (Hadi et al., 2023). Additionally, LLMs exhibit emergent abilities, suchas chain-of-thought reasoning (Wei et al., 2022), which are challenging to replicate in smallermodels without sophisticated knowledge transfer techniques.To surmount these challenges, distillation has emerged as a pivotal strategy, integratingKnowledge Distillation (KD) (Hinton et al., 2015) and Dataset Distillation (DD) (Wanget al., 2018), to tackle both model compression and data efficiency. Crucially, the success ofKD in LLMs hinges on DD techniques, which enable the creation of compact, information-rich synthetic datasets that encapsulate the diverse and complex knowledge of the teacherLLMs.KD transfers knowledge from a large, pre-trainedteachermodel to a smaller, more ef-ficientstudentmodel by aligning outputs or intermediate representations.While effective2 for moderate-scale teacher models, traditional KD struggles with LLMs due to their vastscale, where knowledge is distributed across billions of parameters and intricate attentionpatterns. Moreover, the knowledge is not limited to output distributions or intermediate rep-resentations but also includes higher-order capabilities such as reasoning ability and complexproblem-solving skills (Wilkins and Rodriguez, 2024; Zhao et al., 2023; Latif et al., 2024).DD aims to condense large training datasets into compact synthetic datasets that retainthe essential information required to train models efficiently. Recent work has shown thatDD can significantly reduce the computational burden of LLM training while maintainingperformance. For example, DD can distill millions of training samples into a few hundredsynthetic examples that preserve task-specific knowledge (Cazenavette et al., 2022; Maekawaet al., 2024). When applied to LLMs, DD acts as a critical enabler for KD: it identifies high-impact training examples that reflect the teacher’s reasoning processes, thereby guiding thestudent to learn efficiently without overfitting to redundant data (Sorscher et al., 2022).The scale of LLMs introduces dual challenges: reliance on unsustainable massive datasets(Hadi et al., 2023) and emergent abilities (e.g., chain-of-thought reasoning (Wei et al., 2022))requiring precise transfer. These challenges necessitate a dual focus on KD and DD. WhileKD compresses LLMs by transferring knowledge to smaller models, traditional KD alonecannot address the data efficiency crisis: training newer LLMs on redundant or low-qualitydata yields diminishing returns (Albalak et al., 2024). DD complements KD by curating com-pact, high-fidelity datasets (e.g., rare reasoning patterns (Li et al., 2024)), as demonstratedin LIMA, where 1,000 examples achieved teacher-level performance (Zhou et al., 2023). Thissynergy leverages KD’s ability to transfer learned representations and DD’s capacity to gen-erate task-specific synthetic data that mirrors the teacher’s decision boundaries. Together,they address privacy concerns, computational overhead, and data scarcity, enab