行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

多模态情感识别与大型语言模型

2026-05-20 Hongrui Zhang, Daiqing Wu, Yangyang Li, Kuien Liu, Yuhui Wang, Yu Zhou, Sicheng Zhao 未知机构乐

摘要

核心观点：多模态情感识别（MER）研究正从利用小规模、特定任务模型转向利用大型语言模型（LLMs），形成新的“MER-with-LLMs”范式。该范式具有通用性强、推理灵活等优势，但也面临情感数据稀缺、模态间情感差距以及模型解释性不透明等挑战。

关键挑战：

情感数据稀缺：情感标注数据获取成本高、规模受限，影响模型性能。
模态间情感差距：不同模态（如文本、语音、图像）的情感特征存在异质性和语义差异，导致模型难以有效融合信息。
情感解释不透明：现有方法侧重情感识别，缺乏对模型决策过程的解释，影响用户信任和模型改进。

研究分类及方法：

情感数据增强：
- 无训练样本配置：通过精心设计输入样本，引入人类先验知识，如 EmoDETective、ExpLLM 等。
- 情感数据标注：构建特定领域的情感数据集，如 DEEMO、EmoCause、MESC 等。
多模态情感表征：
- 情感表征优化：通过改进编码过程，提取更丰富的情感特征，如 EmoVIT、VEC-CoT 等。
- 多模态情感协调：通过融合机制，增强模态间交互，如注意力机制、Q-former 等。
多模态情感推理：
- 情感解释与幻觉：提升模型解释性和可靠性，如 Facial-R1、ERV、PEP-MEK 等。
- 主观情感推理：处理情感感知的主观性，如 Agent-MER、EmoCaliber 等。

研究结论：

MER-with-LLMs 范式展现出强大的潜力和发展前景，但仍面临诸多挑战。
研究方法日益多样化，但零样本泛化能力、长尾识别能力、跨文化迁移能力等方面仍有提升空间。
未来研究方向包括：统一和通用的 MER 框架、机制层面的探索、主观性框架的完善、具身情感理解、以及安全性、偏差和文化适应性等伦理问题。

开放研究方向：

统一和通用的 MER 模型。
MER 机制层面的探索。
包容主观性的框架。
具身情感理解。
MER 中的安全性、偏差和文化适应性。

Hongrui Zhang1,2∗,Daiqing Wu1∗,Yangyang Li3,Kuien Liu3,Yuhui Wang3,Yu Zhou4,Sicheng Zhao1†1Department of Psychological and Cognitive Sciences, Tsinghua University, Beijing, China2School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China3Academy of Cyber, Beijing, China4College of Computer Science, Nankai University, Tianjin, Chinasmilingweeping@gmail.com, schzhao@tsinghua.edu.cnAbstractAffective Data ScarcityMultimodal Affective Gap Multimodal Emotion Recognition (MER) focuses onidentifying and interpreting emotions frommodality-compound inputs. Closely mirroring hu-man cognitive processes in real-world environ-ments, MER has drawn substantial attention fromboth academia and industry. Recently, a paradigmshift has been unveiled in MER, from leverag-ing small-scale, task-specific models to Large Lan-guage Models (LLMs).We refer to the latteras the MER-with-LLMs paradigm, which offersunprecedented generality, spurring numerous em-pirical attempts, even alongside speculation aboutLLMs’ potential to achieve general emotional in-telligence.However, with these new opportuni-ties come new challenges, including the scarcity ofemotionally annotated data, the affective gap bothwithin and across modalities, and the opacity of mans do. To achieve this, traditional research primarily re-lies on small-scale, task-specific models [Yanget al., 2025b;Liet al., 2025c]. Despite achieving satisfactory performance,they suffer from strong dependencies on predefined input do-mains and output spaces, which make them less effective inpractical applications that require dynamism and flexibility. The emerging Large Language Models (LLMs) present apromising avenue for addressing these limitations [Shouet al., 2025]. Large-scale generative pre-training equips LLMswith strong instruction-following capabilities, which are in-herited by Multimodal LLMs (MLLMs), giving rise to a newMER-with-LLMsparadigm. By formulating MER as an LLM-centric autoregressive process, this paradigm en-ables unified handling of inputs spanning multiple domainsand modalities, as well as producing diverse instruction-conditioned outputs. It even deepens the MER task itself, ex-tending classification toward explanation [Lianet al., 2024a].Collectively, these fundamental advantages have driven aparadigm shift, reflected by a rapidly growing body of recentwork on MER-with-LLMs.arXiv:2605.21239v1 [cs.MM] 20 May 2026 1Background and Challenges Emotion is an integral component of human daily experi-ences, significantly influencing individuals’ communication,decision-making, and behavior. In real-world settings, emo-tions are expressed and perceived through multiple modali-ties, including language, speech, facial expressions, and ges-tures [Zhaoet al., 2023; Wanget al., 2026b].As a result, Alongside this trend, new challenges have also arrived. Aline of studies [Lianet al., 2024b; Wuet al., 2025a] hasreached a consensus that, under zero-shot inference, MLLMsoften fail to achieve proficiency on MER comparable to thatobserved in other multimodal tasks. This deficiency is inher- ently rooted in the complexity of MER, which entails captur-ing high-level affective cues from multimodal inputs, mod-eling interactions among heterogeneous modalities, and inte-grating them to derive emotional conclusions.Following aprogressive order, we categorize the currently most pressing Affective Data Scarcity.Data constitute the foundation ofmodern models. The underperformance of general MLLMson MER underscores the importance of emotional data, mak-ing targeted optimization on such data a direct and intuitiveremedy. However, the subjectivity of emotion perception of-ten necessitates collaboration from multiple annotators for re-liable annotation, substantially increasing labeling costs andconstraining the scale of high-quality datasets.Moreover,datasets collected in different contexts are typically anno- Multimodal Affective Gap.The “affective gap” generallyrefers to the intra-modality misalignment between emotionaland factual features [Zhaoet al., 2021b], which challengesMLLMs in capturing meaningful affective cues. We extendthis concept to cover an inter-modality setting, where it char-acterizes the heterogeneity and semantic discrepancy of emo-tional features across different modalities. Heterogeneity re-flects inconsistencies in information density and expressionpatterns; for instance, text tends to be more compact and ab-stract, whereas images are more dispersed and concrete. Se- the MER-with-LLMs paradigm naturally supports step-by-step reasoning or explicit natural language explanations foremotional decisions, thereby giving rise to a new task: ex-plainable MER. In this context, existing approaches aim toenhance model transparency and reliability, for example, by gies with divergent emphases have been proposed, signifi-cantly advancing this paradigm, as illustrated in Fig. 1(b).However, their optimization pathways remain intricate anddive

点击免费查看完整报告

多模态情感识别与大型语言模型

摘要

你可能感兴趣

MilChat：为遥感多模态小语言模型引入思维链推理与GRPO技术

基于大型语言模型的智能体的兴起与发展

ELEPHANT：大型语言模型中社会式谄媚的测量与理解

Swarm-GPT：将大型语言模型与机器人编排设计的安全运动规划相结合（中文）

大模型如何判决？从生成到判决：大型语言模型作为裁判的机遇与挑战

大型语言模型（LLM）安全风险、案例与防御策略

大型语言模型的能力与局限性调查

负责任的大型语言模型的综述：固有风险、恶意使用与缓解策略

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

大象：测量与理解大型语言模型中的社会谄媚现象