您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [香港大学商学院&西安交通大学管理学院]:中文语境下大语言模型推理能力评估 - 发现报告

中文语境下大语言模型推理能力评估

报告封面

Zhenhui(Jack)Jiang*1,YiLu,1,YifanWu1,HaozheXu2,ZhengyuWu1,JiaxinLi 1HKUBusinessSchool,TheUniversityofHongKong,HongKong 2SchoolofManagement,Xi'anJiaotongUniversity,P.R.China. Abstract WiththerapiditerationofAItechnologies,reasoningcapabilitieshavebecomeacoreindicatorformeasuringtheintelligenceleveloflargelanguagemodels(LLMs)andafocusofresearchinbothacademiaandindustry.Thisreportaimstoestablishasystematic,objective,and comprehensive evaluation framework to assess AIreasoningcapabilities.Wecompared36LLMsonvarioustext-basedreasoningtasksinChinese-languagecontextsandfoundthatGPT-o3achievedthehighestscoreinthebasiclogicalreasoningevaluation,whileGemini2.5Flashledincontextualreasoning Keywords:LargeLanguageModel,LLM,ReasoningCapability,ModelEfficiency, INTRODUCTION Overthepastfewmonths,reasoningcapabilitieshaveemergedasthenewfrontierintheglobalracetoadvanceLargeLanguageModels(LLMs).FollowingOpenAI’slaunchofitsreasoningmodelsandDeepSeek-R1’srisetonationalprominenceforits Toaddressthisissue,theArtificialIntelligence EvaluationLab(AIEL)at HKUBusinessSchooldevelopedacomprehensiveevaluationframeworkthatassessesbasiclogicalinferenceandcontextualreasoning(Figure1).Buildingonthisframework,the Thestudyincluded36notableLLMsfromChinaandtheUSA.Thisincluded14reasoningmodels,20general-purposemodels,andtwounifiedsystems.AllweretestedwithinaChinese-languagecontext.TheresultsrevealedthatDoubao1.5Pro EVALUATIONMETHODOLOGY (1)ModelsforEvaluation ThestudyevaluatedthefollowingLLMsfrombothChinaandtheUSA(Table1).Dueto local deployment constraints,Llama 4 was excluded from this round of (2)TaskCategoriesandTestSet Inthisstudy,thereasoningevaluationquestionsweredividedintotwotaskcategories:Basic Logical Reasoning and Contextual Reasoning(Table 2).Together,these TestSet:Inthisevaluation,90%ofthetestitemswereeithernewlycreatedorextensively adapted,and the remaining 10%were drawn from real examinationpapers from the 2024 and 2025 China National College Entrance Examination Experts:Theevaluationwasconductedbyateamof38postgraduateresearchersfromChina’sleadinguniversities.Theystrictlyfollowedthestandardizedscoring EvaluationCriteria:Eachmodel’sreasoningperformancewasassessedacrossthreecorecriteria–accuracy,logicalcoherenceandconciseness(Figure2). RESULTSANDANALYSIS (1)BasicLogicalInference AsshowninTable4,GPT-o3achievedthehighestscoreinbasiclogicwith97points, closelyfollowedbyDoubao1.5Pro(96)andDoubao1.5Pro(Thinking)(95).Incontrast,modelslikeLlama3.370B(64)and360Zhinao2-o1(59)displayednotable (2)ContextualReasoning TherankingofcontextualreasoningcapabilityisshowninTable5. TheresultsrevealedthatGemini2.5Flashrankedfirstincontextualreasoningwithanoverallscoreof92,demonstratingnosignificantweaknessacrossanycategories.Itperformedparticularlywellincommon-sensereasoning(98)anddiscipline-based Grok3(Think)rankedfourthwith90,reflectingconsistentperformanceacrossallevaluatedcategories.Inaddition,theseriesofGPT,Ernie,DeepSeek,Hunyuan,and (3)CompositeRankingResults AsshowninTable6,the36modelsassessedexhibitedaclearperformancegradientin the composite rankings.Doubao 1.5 Pro(Thinking)ranked first with a topcompositescoreof93,demonstratingconsistentlystrongandbalancedperformance GPT-5(Auto)(91.5points)followedcloselybehind.FurtheranalysisrevealedthatbecauseGPT-5(Auto)isenabledwiththefunctiontoautomaticallyselectbetweenthegeneral-purposemodeandthereasoningmode,itsometimesdefaultedtothe In general,these results highlight the significant progress and growingcompetitivenessofChina-developedLLMsinreasoning-intensivetasks. Tobetterillustraterelativeperformance,themodelswereorganizedintoafive-tierpyramidbasedon theircomposite scores,withhighertiers representingstronger compositereasoningability(Figure3). (4)AnalysisofPerformancebyModelType Theevaluationshowsthatthecomparativeadvantageofreasoningmodelsgrowswithtaskcomplexity.Forbasiclogicalinference,theirperformanceisonlymarginallybetterthanthatofgeneral-purposemodels.However,forcontextualreasoning,the This trend is also evident when comparing models from the same developer.Reasoningmodelsconsistentlyoutperformtheirgeneral-purposecounterpartsinareassuch as contextual reasoningand hallucination control,leading to higheroverall ADDITIONALANALYSIS:MODELEFFICIENCY Inadditiontoevaluatingreasoningperformance,theresearchteamconductedanin-depthanalysisofmodelefficiencytoassesstheirpracticalutilityinreal-worldapplications.Specifically,theanalysisexaminedhowquicklyandcost-effectivelya DuetolocaldeploymentconstraintsoralackofpublicAPIaccess,Llama3.370B,Grok3(Think),Kimi-k1.5,andStepR1-V-Miniwereexcludedfromtheanalysisduetomissingdata.EfficiencyresultsfortheremainingmodelsarepresentedinFigures TokenEfficiency Tobenchmarktokenefficiency,weemployedtheoutput–inputtokenratioasacoremetric,whereahigherratioindicateslowerefficiency.Thismetrichelpsnormalize ResultsshowthatBaichuan4-Turboleadswithanexceptionallylowratioof1.86,followedbyLlama3.370B(2.49),MiniMax-01(2.76),andStep2(2.78),allof In co