您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[阿里巴巴]:Qwen3-VL-Embedding与Qwen3-VL-Reranker:用于最先进多模态检索与排序的统一框架 - 发现报告

Qwen3-VL-Embedding与Qwen3-VL-Reranker:用于最先进多模态检索与排序的统一框架

信息技术2026-01-09阿里巴巴静***
AI智能总结
查看更多
Qwen3-VL-Embedding与Qwen3-VL-Reranker:用于最先进多模态检索与排序的统一框架

Mingxin Li∗Yanzhao Zhang∗Dingkun Long∗Keqin ChenSibo SongShuai BaiZhibo YangPengjun XieAn YangDayiheng LiuJingren ZhouJunyang Lin Tongyi Lab, Alibaba Group https://huggingface.co/collections/Qwenhttps://modelscope.cn/organization/qwenhttps://github.com/QwenLM/Qwen3-VL-Embedding Abstract In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker modelseries, the latest extensions of the Qwen family built on the Qwen3-VL foundation model.Together, they provide an end-to-end pipeline for high-precision multimodal search bymapping diverse modalities, including text, images, document images, and video, into aunified representation space. The Qwen3-VL-Embedding model employs a multi-stagetraining paradigm, progressing from large-scale contrastive pre-training to rerankingmodel distillation, to generate semantically rich high-dimensional vectors. It supportsMatryoshka Representation Learning, enabling flexible embedding dimensions, andhandles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performsfine-grained relevance estimation for query-document pairs using a cross-encoder ar-chitecture with cross-attention mechanisms. Both model series inherit the multilingualcapabilities of Qwen3-VL, supporting more than 30 languages, and are released in2Band8Bparameter sizes to accommodate diverse deployment requirements. Empiricalevaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically,Qwen3-VL-Embedding-8B attains an overall score of77.8on MMEB-V2, ranking firstamong all models (as of January 8, 2025). This report presents the architecture, trainingmethodology, and practical capabilities of the series, demonstrating their effectivenesson various multimodal retrieval tasks, including image-text retrieval, visual questionanswering, and video-text matching. 1Introduction The exponential growth of multimodal content on the internet has fundamentally transformed howinformation is created, shared, and consumed. Modern digital ecosystems are increasingly populatedwith diverse data modalities, including natural images, text documents, infographics, screenshots, andvideos. This proliferation necessitates advanced retrieval systems capable of understanding and matchingsemantic concepts across different modalities, moving beyond traditional text-only search paradigms.Multimodal search, which aims to retrieve relevant content regardless of the query or document modality,has emerged as a critical capability for applications ranging from e-commerce product discovery toscientific literature exploration and social media navigation (Faysse et al., 2025; Fu et al., 2025). Within contemporary multimodal retrieval architectures, embedding and reranking models constitutethe two most critical modules. The field of multimodal representation learning has witnessed significantprogress over the past decade (Manzoor et al., 2023; Mei et al., 2025). Among these pioneering works,CLIP (Contrastive Language-Image Pre-training) (Radford et al., 2021) has been particularly influentialby demonstrating that large-scale contrastive learning on image-text pairs can produce powerful alignedrepresentations. Its success has cemented the importance of learning shared embedding spaces wheresemantically similar content is positioned proximate in the representation space regardless of its modality. As the development of foundation models accelerates, multimodal pre-trained vision-language models(VLMs) such as Qwen-VL (Wang et al., 2024b; Bai et al., 2025) and GPT-4o (Hurst et al., 2024) haveachieved unprecedented success in multimodal comprehension. Building on these breakthroughs, themultimodal retrieval community has increasingly explored training unified multimodal embeddingmodels based on VLMs. Notable efforts in this space include E5-V (Jiang et al., 2024), GME (Zhanget al., 2025b), BGE-VL (Zhou et al., 2025), and VLM2Vec (Meng et al., 2025; Jiang et al., 2025). Trainingunified multimodal representations based on VLMs offers several compelling advantages. First, VLMspossess inherent cross-modal alignment through their pre-training on large-scale image-text datasets.Second, they leverage sophisticated attention mechanisms to capture fine-grained interactions betweenvisual and textual elements. Third, they provide a natural pathway to handling complex multimodaldocuments such as infographics and presentation slides where visual and textual information are deeplyintertwined. Furthermore, VLM-based approaches can inherit the extensive multilingual and multi-domain knowledge encoded in foundation models, enabling more robust generalization across diverseretrieval scenarios. In this work, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, which arespecifically designed for multimodal retrieval applications. Built upon the powerful Qwen3-VL (Bai et al.,2025) foundation model, these models bring togethe