您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[美国联邦储备委员会]:预算有限情况下的大型语言模型:用于高效分类大型文本语料库的主动知识蒸馏 - 发现报告

预算有限情况下的大型语言模型:用于高效分类大型文本语料库的主动知识蒸馏

预算有限情况下的大型语言模型:用于高效分类大型文本语料库的主动知识蒸馏

Federal Reserve Board, Washington, D.C.ISSN 1936-2854 (Print) LLM on a Budget: Active Knowledge Distillation for EfficientClassification of Large Text Corpora* Viviana Luccioli, Rithika Iyengar, Ryan Panley, Flora Haberkorn, Xiaoyu Ge,Leland Crane, Nitish Sinha, Seung Jung Lee 2025-108 Please cite this paper as:Luccioli, Viviana, Rithika Iyengar, Ryan Panley, Flora Haberkorn, Xiaoyu Ge, Leland Crane,Nitish Sinha, and Seung Jung Lee (2025).“LLM on a Budget: Active Knowledge Distil-lation for Efficient Classification of Large Text Corpora*,” Finance and Economics Dis-cussion Series 2025-108. Washington: Board of Governors of the Federal Reserve System,https://doi.org/10.17016/FEDS.2025.108. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminarymaterials circulated to stimulate discussion and critical comment.The analysis and conclusions set forthare those of the authors and do not indicate concurrence by other members of the research staff or the LLM on a Budget: Active Knowledge Distillation forEfficient Classification of Large Text Corpora* Viviana Luccioli, Rithika Iyengar, Ryan Panley, Flora Haberkorn,Xiaoyu Ge, Leland Crane, Nitish Sinha, Seung Jung Lee Abstract Large Language Models (LLMs) are highly accurate in classification tasks, however, sub-stantial computational and financial costs hinder their large-scale deployment in dynamic en-vironments. Knowledge Distillation (KD) where a LLM ”teacher” trains a smaller and moreefficient ”student” model, offers a promising solution to this problem. However, the distillationprocess itself often remains costly for large datasets, since it requires the teacher to label a vastnumber of samples while incurring significant token consumption. To alleviate this challenge,in this work we explore the active learning (AL) as a way to create efficient student models ata fraction of the cost while preserving the LLM’s performance. In particular, we introduce M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling), a novel AL algorithm JEL classification: C38, C45, C55 Keywords: Knowledge Distillation, Large Language Models (LLM), Active Learning, Uncer-tainty Sampling, Multi-Class Randomized Accept/Reject Uncertainty Sampling (M-RARU), Text 1Introduction With the unceasing expansion of unstructured text in the modern data landscape, text classi-fication has become a central tool for extracting insights at scale. For instance, in the financialsector, this capability is especially critical for a diverse array of tasks, ranging from analyzingmarket trends in news reports and corporate filings to assessing credit risk and ensuring regulatory Consider, for example, the task of classifying news articles based on their implications forGDP trends, as illustrated in Figure 1. Financial institutions must process thousands of such ar-ticles daily to inform investment decisions and economic forecasts. While an LLM can achievehigh accuracy in determining whether an article suggests GDP is ’falling,’ ’rising,’ or ’staying flat,’the computational cost of processing this volume of text at the required speed is prohibitive. Con- To address this problem, two primary categories of models have been widely adopted: large-scale transformer models and traditional machine learning algorithms. Transformer architectures,first introduced in (3) and popularized by Large Language Models (LLMs) like GPT, Claude, andGemini, represent the state-of-the-art in performance (4).By leveraging complex self-attentionmechanisms and deep semantic embeddings, they achieve a nuanced understanding of languagethat often translates to superior classification accuracy.However, this power comes at a steep ficient, offering rapid training and classification at a fraction of the cost. More importantly, theirdecisions are far more interpretable, a critical feature in domains where justifying a model’s rea-soning is paramount.Yet, these models typically requires domain specific supervision and has A promising approach to bridge this gap is Knowledge Distillation (KD), a technique wherea large, high-performing “teacher” model (the LLM) is used to train a smaller, more efficient“student” model (the traditional ML algorithm) (8; 9; 10).The goal is to transfer the teacher’ssophisticated “knowledge” to the student, thereby combining the high accuracy of an LLM withthe efficiency and interpretability of a classical algorithm. However, a major bottleneck persists: Fortunately, this challenge of minimizing labeling costs by selecting only the most valuabledata points is precisely the problem addressed by the field of active learning (AL) (11). The coreidea of active learning is to allow a machine learning algorithm to intelligently choose the data the most informative unlabeled samples. By focusing the labeling effort on instances the modelis most needed, AL has the potential to achieve high accuracy with a fraction of the labeled data In this paper, we prop