AI智能总结
Mohamad Chehade* 1Soumya Suvra Ghosal* 2Souradip Chakraborty2Avinash Reddy3Dinesh Manocha2Hao Zhu1 Amrit Singh Bedi3 Abstract tured by multiple reward models (Bai et al., 2022b; Daiet al.; Maas et al., 2011). Prior alignment literature (Janget al., 2023; Shi et al., 2024) has approached the problemthrough the lens of multi-objective optimization, ignoringthe underlying special structure characterizing human deci-sion making or preferences. Additionally, this traditionalformulation faces significant challenges: determining appro-priate weights for different, often conflicting, objectives isdifficult, and more fundamentally, it assumes that all prefer-ence dimensions should be simultaneously maximized. Aligning large language models with humans ischallenging due to the inherently multifacetednature of preference feedback. While existing ap-proaches typically frame this as a multi-objectiveoptimization problem, they often overlook howhumans actually make decisions.Research onbounded rationality suggests that human decision-making follows satisficing strategies—optimizingprimary objectives while ensuring others meet ac-ceptable thresholds (Simon, 1956). To bridge thisgap and operationalize the notion of satisficingalignment, we proposeSITAlign: an inference-time framework that addresses the multifacetednature of alignment by maximizing a primaryobjective while satisfying threshold-based con-straints on secondary criteria. We provide theo-retical insights by deriving suboptimality boundsof our satisficing-based inference alignment ap-proach.We empirically validateSITAlign’sperformance through extensive experimentationon multiple benchmarks.For instance, on thePKU-SafeRLHF dataset with the primary objec-tive of maximizing helpfulness while ensuring athreshold on harmlessness,SITAlignoutper-forms the state-of-the-art multi-objective decod-ing strategy by a margin of22.3%in terms ofGPT-4 win-tie rate for helpfulness reward whileadhering to the threshold on harmlessness. Beyond traditional alignment.Drawing inspiration fromtheories of human decision-making, particularly the satisfic-ing principles from bounded rationality (Simon, 1956)—wepropose an alternate perspective for multi-faceted LLMalignment. Satisficing theory suggests that humans often donot seek to maximize every objective; instead, they adoptstrategies that prioritize optimizing key goals while ensuringother important objectives simply meet acceptable thresh-olds.For example, one might prioritize finding a goodenough solution quickly rather than searching exhaustivelyfor the absolute best. We argue that this principle translatesnaturally to LLM alignment: it is often sufficient to maxi-mize a primary objective, such as helpfulness or relevance,while ensuring other attributes, like harmlessness, bias, orverbosity, stay within acceptable thresholds (cf. Section3.1). arXiv:2505.23729v2 [cs.CL] 31 May 2025SatisficingAlignment.The‘satisficing alignment’paradigm represents a significant departure from traditionalmulti-objective maximization approaches. While the latteraims for optimal trade-offs across all objective dimensionssimultaneously, often via a single scalar objective obtainsby weighted combination (Shi et al., 2024; Son et al., 2025),a satisficing approach explicitly acknowledges that some ob-jectives behave as constraints to be met rather than targets tobe continuously improved. The existing alignment researchhas largely overlooked this satisficing perspective, focus-ing predominantly on methods that combine or maximizemultiple reward signals without explicitly incorporatingthreshold-based constraints derived from human decision-making principles. Implementing such a flexible, threshold-based alignment strategy through traditional finetuning ischallenging, especially since acceptable thresholds can varywidely across users, contexts, and tasks. This necessitates an 1. Introduction Aligning large language models (LLMs) with human pref-erences is crucial for their safe, helpful, and effective de-ployment. However, these preferences are inherently multi-faceted, requiring consideration of multiple attributes likesafety, helpfulness, truthfulness, and conciseness, often cap- demand.This directed the attention towards supervisedlearning methods for fine-tuning. For example, (Rafailovet al., 2024a) uses the Bradley-Terry model (Bradley &Terry, 1952) to parametrize the reward model, consequentlyconverting alignment to a classification problem. Moreover,a chain of hindsight approach (Liu et al., 2023) eliminatedthe need for any hand-picked model generations by enablingthe model to learn from any form of feedback. (Faiz et al.,2023) use a ranking loss to align the model probabilities ofresponses while (Dong et al., 2023) suggest supervised fine-tuning on the highest reward samples. The self-play tuningof (Chen et al., 2024) even removes the necessity for anyhuman-annotated dataset. Authors in (Huang et al., 2024b)have proposed a contained RLHF versio