行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

DELIFT：数据高效语言模型指令微调

信息技术 2025-04-24 IBM 灰灰

核心观点

本文提出了 DELIFT（Data-Efficient Language Model Instruction Fine-Tuning），一种基于信息理论的统一数据选择框架，旨在提高大型语言模型 (LLM) 微调的效率。DELIFT 通过计算数据样本之间的互信息，评估每个样本作为上下文示例的有效性，从而选择最具信息量的数据子集。

方法

DELIFT 结合了以下关键组件：

成对效用度量：基于条件点互信息，量化一个样本对另一个样本预测的信息增益。
子模块优化：利用设施位置 (FL)、设施位置互信息 (FLMI) 和设施位置条件增益 (FLCG) 等目标，选择多样化的、信息量大的数据子集，以适应不同的微调阶段。
单站解决方案：DELIFT 可用于指令微调、特定任务适应和持续微调等所有微调阶段。

实验结果

在多个数据集和模型规模上进行的实验表明，DELIFT 可以在不影响性能的情况下减少高达 70% 的微调数据需求，并在有效性和效率方面比现有方法高出高达 26%。

指令微调：DELIFT 可以删除高达 70% 的指令数据，同时保持甚至提高性能。
特定任务微调：DELIFT 可以选择与目标任务最相关的样本，从而提高模型在该任务上的性能。
持续微调：DELIFT 可以选择与现有知识互补的新样本，同时避免冗余。

研究结论

DELIFT 证明了精心设计的效用驱动数据选择可以比单纯的数据量更有效，为更资源友好和有针对性的 LLM 微调打开了大门。未来的工作将探索将 DELIFT 与数据增强技术、公平性和偏差缓解策略以及多模态学习相结合。

STRUCTIONFINE-TUNING Ishika Agarwal1, Krishnateja Killamsetty2, Lucian Popa21University of Illinois Urbana-Champaign,2IBM Research1ishikaa2@illinois.edu2krishnateja.k@ibm.com,{lpopa, mdanile}@us.ibm.comABSTRACT Fine-tuning large language models (LLMs) is crucial for task specialization butoften becomes resource-intensive due to redundant or uninformative data.Ex-isting data selection methods typically rely either on computationally expensivegradient-based metrics or static embeddings that fail to adapt dynamically to themodel’s evolving state, thus limiting their practical effectiveness. To address this,we propose DELIFT (Data Efficient Language model Instruction Fine-Tuning),leveraging a novel, computationally efficient utility metric inspired by In-ContextLearning (ICL). Our ICL-based metric measures the informational value of eachdata sample by quantifying its effectiveness as an in-context example in improv-ing model predictions for other samples, reflecting its actual contribution relativeto the model’s current state.Integrated with tailored submodular optimization 1INTRODUCTION Large Language Models (LLMs) have become indispensable for solving a variety of natural lan-guage processing tasks, ranging from question answering and summarization to complex dialogueand reasoning (Brown et al., 2020; Touvron et al., 2023).Despite their remarkable adaptability,fine-tuning LLMs often requires enormous computational resources and time, especially when asignificant portion of the training data is either redundant or uninformative (Gururangan et al., 2020; Existing data selection methods generally fall under two paradigms: (1)static embedding-basedap-proaches that compute sample similarities without reflecting the model’s evolving state (Bukharin& Zhao, 2024; Chen et al., 2024), and (2)gradient-basedmethods that offer more model-specificfeedback but often entail prohibitive computational overhead, especially for large-scale models (Kil-arXiv:2411.04425v3 [cs.CL] 20 Mar 2025 lamsetty et al., 2021b; Xia et al., 2024). Although both paradigms can yield initial benefits, theyoften fail to account for how a model’s knowledge shifts over multiple fine-tuning phases:(1) In-struction Tuning(Mishra et al., 2022; Wei et al., 2022; Longpre et al., 2023), which enhances themodel’s ability to follow diverse instructions;(2) Task-Specific Fine-Tuning(Gururangan et al., 2020; Cobbe et al., 2021), which focuses on refining domain expertise; and(3) Continual Fine- Thus, a natural question arises: Can we develop a unified, computationally efficient data selection framework that adapts to allstages of fine-tuning and maximizes model performance while minimizing data redundancy? Figure 1: DELIFT data selection across fine-tuning stages. (a)Instruction Tuning: Diverse instruc-tions selected; redundant samples pruned. (b)Task-Specific Fine-Tuning: Mutually informative(with benchmark data) and diverse samples are prioritized for selection. (c)Continual Fine-tuning:New samples that are novel are integrated; new samples with overlapping information are pruned. In this paper, we introduce DELIFT (Data-Efficient Language Model Instruction Fine-Tuning), asingle-stop solutiondesigned to address data selection acrossallfine-tuning stages within a single framework. DELIFT isgrounded in information theoryyet uses the practical intuition of in-contextexamples to assess the ’information gain’ of each data sample relative to the current state of a model.Specifically, we propose a new utility metric that captures how effectively one sample improves themodel’s prediction of another. By combining these pairwise utilities with submodular optimization, We evaluated DELIFT on various tasks and model scales, consistently observing that it can pruneup to 70% of the training data without hurting performance - and often improving it - outperformingexisting methods by up to 26% in efficiency and effectiveness. In doing so, we show thatcareful,utility-driven data selection can be far more effectivethan sheer data volume, opening the door to Our primary contributions are as follows. 1.A unified information-theoretic data selection paradigmthat leverages pairwise utilitiesgrounded in conditional pointwise mutual information, making it adaptable to instruction tuning,task-specific adaptation, and continual fine-tuning. 2. A single-stop, submodular optimization frameworkthat integrates these utilities to providediverse, high-value subsets for each fine-tuning stage without incurring prohibitive computation. 3. Extensive empirical validationshowing up to 70% data reduction with minimal (and sometimeszero) performance loss across multiple domains, demonstrating substantial gains in both efficacyand efficiency. The remainder of this paper is organized as follows. Section 2 reviews prior work on data-efficientstrategies for fine-tuning LLMs and situates our approach within the literature. Section 3 introducesour information-theoretic utility metric

点击免费查看完整报告

DELIFT：数据高效语言模型指令微调

核心观点

方法

实验结果

研究结论

你可能感兴趣

德邦金工文献精译第八期：训练语言模型以遵循带有人类反馈的指令

您需要了解的有关多语言 LLM 的一切：为世界语言建立公平，高效和可靠的模型

预算有限情况下的大型语言模型：用于高效分类大型文本语料库的主动知识蒸馏

迈向高效、科学且可访问的小语言模型开发

传媒互联网周报：苹果开源推出高效语言模型OpenELM，五一档预售票房破亿

您需要了解的有关多语言 LLM 的一切：为世界语言建立公平，高效和可靠的模型

【盘中宝】Meta联手CMU打造通用机器人智能体！可通过图像或者语言指令，指挥机器人完成任务，这家公司以该技术为核心为客户提供综合解决方案

开源视角下看大规模语言模型研发中的数据工程、自动化评估及与知识图谱的结合

语言模型合成数据的最佳实践和经验教训

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

DELIFT：数据高效语言模型指令微调

你可能感兴趣

德邦金工文献精译第八期：训练语言模型以遵循带有人类反馈的指令

您需要了解的有关多语言 LLM 的一切 ： 为世界语言建立公平 ， 高效和可靠的模型

预算有限情况下的大型语言模型：用于高效分类大型文本语料库的主动知识蒸馏

迈向高效、科学且可访问的小语言模型开发

传媒互联网周报：苹果开源推出高效语言模型OpenELM，五一档预售票房破亿

您需要了解的有关多语言 LLM 的一切 ： 为世界语言建立公平 ， 高效和可靠的模型

【盘中宝】Meta联手CMU打造通用机器人智能体！可通过图像或者语言指令，指挥机器人完成任务，这家公司以该技术为核心为客户提供综合解决方案

开源视角下看大规模语言模型研发中的数据工程、自动化评估及与知识图谱的结合

语言模型合成数据的最佳实践和经验教训

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

您需要了解的有关多语言 LLM 的一切：为世界语言建立公平，高效和可靠的模型

您需要了解的有关多语言 LLM 的一切：为世界语言建立公平，高效和可靠的模型