Hongrui Zhang1,2∗,Daiqing Wu1∗,Yangyang Li3,Kuien Liu3,Yuhui Wang3,Yu Zhou4,Sicheng Zhao1†1Department of Psychological and Cognitive Sciences, Tsinghua University, Beijing, China2School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China3Academy of Cyber, Beijing, China4College of Computer Science, Nankai University, Tianjin, Chinasmilingweeping@gmail.com, schzhao@tsinghua.edu.cnAbstractAffective Data ScarcityMultimodal Affective Gap Multimodal Emotion Recognition (MER) focuses onidentifying and interpreting emotions frommodality-compound inputs. Closely mirroring hu-man cognitive processes in real-world environ-ments, MER has drawn substantial attention fromboth academia and industry. Recently, a paradigmshift has been unveiled in MER, from leverag-ing small-scale, task-specific models to Large Lan-guage Models (LLMs).We refer to the latteras the MER-with-LLMs paradigm, which offersunprecedented generality, spurring numerous em-pirical attempts, even alongside speculation aboutLLMs’ potential to achieve general emotional in-telligence.However, with these new opportuni-ties come new challenges, including the scarcity ofemotionally annotated data, the affective gap bothwithin and across modalities, and the opacity of mans do. To achieve this, traditional research primarily re-lies on small-scale, task-specific models [Yanget al., 2025b;Liet al., 2025c]. Despite achieving satisfactory performance,they suffer from strong dependencies on predefined input do-mains and output spaces, which make them less effective inpractical applications that require dynamism and flexibility. The emerging Large Language Models (LLMs) present apromising avenue for addressing these limitations [Shouet al., 2025]. Large-scale generative pre-training equips LLMswith strong instruction-following capabilities, which are in-herited by Multimodal LLMs (MLLMs), giving rise to a newMER-with-LLMsparadigm. By formulating MER as an LLM-centric autoregressive process, this paradigm en-ables unified handling of inputs spanning multiple domainsand modalities, as well as producing diverse instruction-conditioned outputs. It even deepens the MER task itself, ex-tending classification toward explanation [Lianet al., 2024a].Collectively, these fundamental advantages have driven aparadigm shift, reflected by a rapidly growing body of recentwork on MER-with-LLMs.arXiv:2605.21239v1 [cs.MM] 20 May 2026 1Background and Challenges Emotion is an integral component of human daily experi-ences, significantly influencing individuals’ communication,decision-making, and behavior. In real-world settings, emo-tions are expressed and perceived through multiple modali-ties, including language, speech, facial expressions, and ges-tures [Zhaoet al., 2023; Wanget al., 2026b].As a result, Alongside this trend, new challenges have also arrived. Aline of studies [Lianet al., 2024b; Wuet al., 2025a] hasreached a consensus that, under zero-shot inference, MLLMsoften fail to achieve proficiency on MER comparable to thatobserved in other multimodal tasks. This deficiency is inher- ently rooted in the complexity of MER, which entails captur-ing high-level affective cues from multimodal inputs, mod-eling interactions among heterogeneous modalities, and inte-grating them to derive emotional conclusions.Following aprogressive order, we categorize the currently most pressing Affective Data Scarcity.Data constitute the foundation ofmodern models. The underperformance of general MLLMson MER underscores the importance of emotional data, mak-ing targeted optimization on such data a direct and intuitiveremedy. However, the subjectivity of emotion perception of-ten necessitates collaboration from multiple annotators for re-liable annotation, substantially increasing labeling costs andconstraining the scale of high-quality datasets.Moreover,datasets collected in different contexts are typically anno- Multimodal Affective Gap.The “affective gap” generallyrefers to the intra-modality misalignment between emotionaland factual features [Zhaoet al., 2021b], which challengesMLLMs in capturing meaningful affective cues. We extendthis concept to cover an inter-modality setting, where it char-acterizes the heterogeneity and semantic discrepancy of emo-tional features across different modalities. Heterogeneity re-flects inconsistencies in information density and expressionpatterns; for instance, text tends to be more compact and ab-stract, whereas images are more dispersed and concrete. Se- the MER-with-LLMs paradigm naturally supports step-by-step reasoning or explicit natural language explanations foremotional decisions, thereby giving rise to a new task: ex-plainable MER. In this context, existing approaches aim toenhance model transparency and reliability, for example, by gies with divergent emphases have been proposed, signifi-cantly advancing this paradigm, as illustrated in Fig. 1(b).However, their optimization pathways remain intricate anddive