您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [阿里巴巴]:QuarkMed医疗基础模型技术报告 - 发现报告

QuarkMed医疗基础模型技术报告

2025-08-19 阿里巴巴 张彦男 Tim
报告封面

Ao Li1, Bin Yan1, Bingfeng Cai1, Chenxi Li1, Cunzhong Zhao1, Fugen Yao1, Gaoqiang Liu1, Guanjun Jiang1,Jian Xu1, Liang Dong1, Liansheng Sun1, Rongshen Zhang1, Xiaolei Gui1, Xin Liu1, Xin Shang1, Yao Wu1, YuCao1, Zhenxin Ma1and Zhuang Jia11Quark Medical Team, Alibaba Group Recent advancements in large language models have significantly accelerated their adoption in health-care applications, including AI-powered medical consultations, diagnostic report assistance, and medicalsearch tools. However, medical tasks often demand highly specialized knowledge, professional accuracy,and customization capabilities, necessitating a robust and reliable foundation model. QuarkMed ad-dresses these needs by leveraging curated medical data processing, medical-content Retrieval-AugmentedGeneration (RAG), and a large-scale, verifiable reinforcement learning pipeline to develop a high-performance medical foundation model.The model achieved 70% accuracy on the Chinese Medi-cal Licensing Examination, demonstrating strong generalization across diverse medical benchmarks.QuarkMed offers a powerful yet versatile personal medical AI solution, already serving over millions ofusers at https://ai.quark.cn. 1. Introduction The advent of large language models (LLMs) has marked a pivotal moment in artificial intelligence, demonstrat-ing remarkable capabilities in understanding and generating human-like text across a multitude of domains.This progress has catalyzed significant interest in their application to specialized fields, particularly medicine,where they hold the potential to revolutionize medical information retrieval, enhance early diagnostic accuracy,and support personalized health care requirements. However, the medical domain presents unique and formidable challenges [47]. Unlike general-domain text,medical language is characterized by a highly specialized vocabulary, complex clinical concepts, and a nuancedsyntax that is often ambiguous and context-dependent. As a result, general-purpose LLMs, which are typicallyfine-tuned for broad, non-medical corpora, often lack the deep, specialized knowledge required for high-stakesmedical applications [1]. This knowledge gap can lead to unsatisfactory, and at times unsafe, performancewhen these models are directly applied to medical tasks. Recognizing these limitations, the research community has shifted towards developing domain-specificfoundation models for medicine. This endeavor began with the adaptation of Transformer-based architectureslike BERT (Bidirectional Encoder Representations from Transformers). Early pioneering work led to the creationof models such as BioBERT [24], which was pre-trained on large-scale biomedical literature, and ClinicalBERT[18], which was trained on unstructured clinical notes from electronic health records (EHRs). These modelsdemonstrated that domain-specific pre-training significantly improves performance on various biomedical textmining tasks. Following this trend, models like BEHRT were developed to specifically model structured EHRdata for predicting clinical events [27]. The success of these earlier models paved the way for the development of generative models tailored formedicine. BioGPT, for instance, was a generative pre-trained transformer that excelled at creating fluentbiomedical text and improving performance on downstream tasks [29]. As model scaling became a key driverof performance, the field saw the emergence of significantly larger and more powerful medical LLMs. Modelslike GatorTron, with billions of parameters trained on massive clinical text datasets, demonstrated the benefitsof scale in capturing the long-range dependencies and intricate relationships within clinical narratives [49]. More recently, the landscape has been defined by even larger and more sophisticated models that integrateextensive medical knowledge with robust instruction-following capabilities.Med-PaLM and its successor were among the first to approach expert-level performance on medical licensing examination-style questions,leveraging a combination of improved base models, medical domain fine-tuning, and advanced promptingstrategies [1,39]. Concurrently, the open-source community has produced a variety of powerful medical LLMs.Models like PMC-LLaMA [45], MEDITRON-70B [7], BioMedLM [33], and BioMistral [23] have been developedby pre-training on vast corpora of biomedical literature and clinical data, showing performance competitive withproprietary models. This proliferation of models has been accompanied by the creation of more comprehensiveand challenging benchmarks, such as MedExQA [12] and MedS-Bench [46], which evaluate LLMs on morecomplex, long-form question answering and a wider array of clinical tasks. Beyond supervised learning, Reinforcement Learning (RL) has emerged as a powerful paradigm for optimiz-ing sequential decision-making, making it a promising approach for healthcare applications such as developingdynamic treatment regimes [19]. Concurren