STRUCTIONFINE-TUNING Ishika Agarwal1, Krishnateja Killamsetty2, Lucian Popa21University of Illinois Urbana-Champaign,2IBM Research1ishikaa2@illinois.edu2krishnateja.k@ibm.com,{lpopa, mdanile}@us.ibm.comABSTRACT Fine-tuning large language models (LLMs) is crucial for task specialization butoften becomes resource-intensive due to redundant or uninformative data.Ex-isting data selection methods typically rely either on computationally expensivegradient-based metrics or static embeddings that fail to adapt dynamically to themodel’s evolving state, thus limiting their practical effectiveness. To address this,we propose DELIFT (Data Efficient Language model Instruction Fine-Tuning),leveraging a novel, computationally efficient utility metric inspired by In-ContextLearning (ICL). Our ICL-based metric measures the informational value of eachdata sample by quantifying its effectiveness as an in-context example in improv-ing model predictions for other samples, reflecting its actual contribution relativeto the model’s current state.Integrated with tailored submodular optimization 1INTRODUCTION Large Language Models (LLMs) have become indispensable for solving a variety of natural lan-guage processing tasks, ranging from question answering and summarization to complex dialogueand reasoning (Brown et al., 2020; Touvron et al., 2023).Despite their remarkable adaptability,fine-tuning LLMs often requires enormous computational resources and time, especially when asignificant portion of the training data is either redundant or uninformative (Gururangan et al., 2020; Existing data selection methods generally fall under two paradigms: (1)static embedding-basedap-proaches that compute sample similarities without reflecting the model’s evolving state (Bukharin& Zhao, 2024; Chen et al., 2024), and (2)gradient-basedmethods that offer more model-specificfeedback but often entail prohibitive computational overhead, especially for large-scale models (Kil-arXiv:2411.04425v3 [cs.CL] 20 Mar 2025 lamsetty et al., 2021b; Xia et al., 2024). Although both paradigms can yield initial benefits, theyoften fail to account for how a model’s knowledge shifts over multiple fine-tuning phases:(1) In-struction Tuning(Mishra et al., 2022; Wei et al., 2022; Longpre et al., 2023), which enhances themodel’s ability to follow diverse instructions;(2) Task-Specific Fine-Tuning(Gururangan et al., 2020; Cobbe et al., 2021), which focuses on refining domain expertise; and(3) Continual Fine- Thus, a natural question arises: Can we develop a unified, computationally efficient data selection framework that adapts to allstages of fine-tuning and maximizes model performance while minimizing data redundancy? Figure 1: DELIFT data selection across fine-tuning stages. (a)Instruction Tuning: Diverse instruc-tions selected; redundant samples pruned. (b)Task-Specific Fine-Tuning: Mutually informative(with benchmark data) and diverse samples are prioritized for selection. (c)Continual Fine-tuning:New samples that are novel are integrated; new samples with overlapping information are pruned. In this paper, we introduce DELIFT (Data-Efficient Language Model Instruction Fine-Tuning), asingle-stop solutiondesigned to address data selection acrossallfine-tuning stages within a single framework. DELIFT isgrounded in information theoryyet uses the practical intuition of in-contextexamples to assess the ’information gain’ of each data sample relative to the current state of a model.Specifically, we propose a new utility metric that captures how effectively one sample improves themodel’s prediction of another. By combining these pairwise utilities with submodular optimization, We evaluated DELIFT on various tasks and model scales, consistently observing that it can pruneup to 70% of the training data without hurting performance - and often improving it - outperformingexisting methods by up to 26% in efficiency and effectiveness. In doing so, we show thatcareful,utility-driven data selection can be far more effectivethan sheer data volume, opening the door to Our primary contributions are as follows. 1.A unified information-theoretic data selection paradigmthat leverages pairwise utilitiesgrounded in conditional pointwise mutual information, making it adaptable to instruction tuning,task-specific adaptation, and continual fine-tuning. 2. A single-stop, submodular optimization frameworkthat integrates these utilities to providediverse, high-value subsets for each fine-tuning stage without incurring prohibitive computation. 3. Extensive empirical validationshowing up to 70% data reduction with minimal (and sometimeszero) performance loss across multiple domains, demonstrating substantial gains in both efficacyand efficiency. The remainder of this paper is organized as follows. Section 2 reviews prior work on data-efficientstrategies for fine-tuning LLMs and situates our approach within the literature. Section 3 introducesour information-theoretic utility metric