行业研究公司研究宏观策略财报招股书会议纪要 seedance2.0 低空经济 DeepSeek AIGC 大模型

中文语境下大语言模型推理能力评估

文化传媒 2025-08-25 香港大学商学院&西安交通大学管理学院 Man💗

本研究评估了36个大型语言模型（LLM）在中文语境下的推理能力，并分析了其效率。研究发现：

核心观点
推理能力已成为衡量LLM智能水平的关键指标，本研究旨在建立系统评估框架，对比分析中外LLM的推理表现。

关键数据

推理能力排名
- 基本逻辑推理：GPT-o3（97分）第一，Doubao 1.5 Pro（96分）和Doubao 1.5 Pro（Thinking）（95分）紧随其后。
- 情境推理：Gemini 2.5 Flash（92分）第一，Doubao 1.5 Pro（Thinking）（91分）和Gemini 2.5 Pro（91分）并列第二。
- 综合排名：Doubao 1.5 Pro（Thinking）（93分）第一，GPT-5（Auto）（91.5分）第二，GPT-o3（91分）和Doubao 1.5 Pro（90.5分）分别第三、第四。
模型类型对比
- 推理模型在情境推理和幻觉控制方面显著优于通用模型，综合排名更高。
- 中国模型（如Doubao、Qwen、DeepSeek）表现突出，整体排名靠前。

效率分析

Token效率：
- Baichuan 4-Turbo（1.86）最高，DeepSeek-R1（30.77）、Qwen 3（31.04）等Token消耗过高。
- Gemini系列Token处理效率高，尽管消耗量大但响应速度快。
响应时间：
- GPT-4o（5.36秒）最快，DeepSeek-R1（127.59秒）最慢。
- Gemini系列响应速度优于Token消耗相似的模型。
API成本：
- 中国模型（如Yi-Lightning）成本优势明显（0.08元/千次），通用模型低于推理模型。
- DeepSeek-R1因Token消耗大，实际成本（6.77元/千次）高于GPT-o4 mini。

研究结论

中国LLM在推理任务中表现强劲，Doubao 1.5 Pro（Thinking）兼具高效与高智能，是综合最优模型。
推理模型在复杂任务中优势显著，但Token效率与成本需进一步优化。
未来模型迭代将提升推理质量、降低延迟和成本，推动LLM在更广泛场景的应用。

Zhenhui(Jack)Jiang*1,YiLu,1,YifanWu1,HaozheXu2,ZhengyuWu1,JiaxinLi 1HKUBusinessSchool,TheUniversityofHongKong,HongKong 2SchoolofManagement,Xi'anJiaotongUniversity,P.R.China. Abstract WiththerapiditerationofAItechnologies,reasoningcapabilitieshavebecomeacoreindicatorformeasuringtheintelligenceleveloflargelanguagemodels(LLMs)andafocusofresearchinbothacademiaandindustry.Thisreportaimstoestablishasystematic,objective,and comprehensive evaluation framework to assess AIreasoningcapabilities.Wecompared36LLMsonvarioustext-basedreasoningtasksinChinese-languagecontextsandfoundthatGPT-o3achievedthehighestscoreinthebasiclogicalreasoningevaluation,whileGemini2.5Flashledincontextualreasoning Keywords:LargeLanguageModel,LLM,ReasoningCapability,ModelEfficiency, INTRODUCTION Overthepastfewmonths,reasoningcapabilitieshaveemergedasthenewfrontierintheglobalracetoadvanceLargeLanguageModels(LLMs).FollowingOpenAI’slaunchofitsreasoningmodelsandDeepSeek-R1’srisetonationalprominenceforits Toaddressthisissue,theArtificialIntelligence EvaluationLab(AIEL)at HKUBusinessSchooldevelopedacomprehensiveevaluationframeworkthatassessesbasiclogicalinferenceandcontextualreasoning(Figure1).Buildingonthisframework,the Thestudyincluded36notableLLMsfromChinaandtheUSA.Thisincluded14reasoningmodels,20general-purposemodels,andtwounifiedsystems.AllweretestedwithinaChinese-languagecontext.TheresultsrevealedthatDoubao1.5Pro EVALUATIONMETHODOLOGY （1）ModelsforEvaluation ThestudyevaluatedthefollowingLLMsfrombothChinaandtheUSA(Table1).Dueto local deployment constraints,Llama 4 was excluded from this round of （2）TaskCategoriesandTestSet Inthisstudy,thereasoningevaluationquestionsweredividedintotwotaskcategories:Basic Logical Reasoning and Contextual Reasoning(Table 2).Together,these TestSet:Inthisevaluation,90%ofthetestitemswereeithernewlycreatedorextensively adapted,and the remaining 10%were drawn from real examinationpapers from the 2024 and 2025 China National College Entrance Examination Experts:Theevaluationwasconductedbyateamof38postgraduateresearchersfromChina’sleadinguniversities.Theystrictlyfollowedthestandardizedscoring EvaluationCriteria:Eachmodel’sreasoningperformancewasassessedacrossthreecorecriteria–accuracy,logicalcoherenceandconciseness(Figure2). RESULTSANDANALYSIS （1）BasicLogicalInference AsshowninTable4,GPT-o3achievedthehighestscoreinbasiclogicwith97points, closelyfollowedbyDoubao1.5Pro(96)andDoubao1.5Pro(Thinking)(95).Incontrast,modelslikeLlama3.370B(64)and360Zhinao2-o1(59)displayednotable （2）ContextualReasoning TherankingofcontextualreasoningcapabilityisshowninTable5. TheresultsrevealedthatGemini2.5Flashrankedfirstincontextualreasoningwithanoverallscoreof92,demonstratingnosignificantweaknessacrossanycategories.Itperformedparticularlywellincommon-sensereasoning(98)anddiscipline-based Grok3(Think)rankedfourthwith90,reflectingconsistentperformanceacrossallevaluatedcategories.Inaddition,theseriesofGPT,Ernie,DeepSeek,Hunyuan,and （3）CompositeRankingResults AsshowninTable6,the36modelsassessedexhibitedaclearperformancegradientin the composite rankings.Doubao 1.5 Pro(Thinking)ranked first with a topcompositescoreof93,demonstratingconsistentlystrongandbalancedperformance GPT-5(Auto)(91.5points)followedcloselybehind.FurtheranalysisrevealedthatbecauseGPT-5(Auto)isenabledwiththefunctiontoautomaticallyselectbetweenthegeneral-purposemodeandthereasoningmode,itsometimesdefaultedtothe In general,these results highlight the significant progress and growingcompetitivenessofChina-developedLLMsinreasoning-intensivetasks. Tobetterillustraterelativeperformance,themodelswereorganizedintoafive-tierpyramidbasedon theircomposite scores,withhighertiers representingstronger compositereasoningability(Figure3). （4）AnalysisofPerformancebyModelType Theevaluationshowsthatthecomparativeadvantageofreasoningmodelsgrowswithtaskcomplexity.Forbasiclogicalinference,theirperformanceisonlymarginallybetterthanthatofgeneral-purposemodels.However,forcontextualreasoning,the This trend is also evident when comparing models from the same developer.Reasoningmodelsconsistentlyoutperformtheirgeneral-purposecounterpartsinareassuch as contextual reasoningand hallucination control,leading to higheroverall ADDITIONALANALYSIS:MODELEFFICIENCY Inadditiontoevaluatingreasoningperformance,theresearchteamconductedanin-depthanalysisofmodelefficiencytoassesstheirpracticalutilityinreal-worldapplications.Specifically,theanalysisexaminedhowquicklyandcost-effectivelya DuetolocaldeploymentconstraintsoralackofpublicAPIaccess,Llama3.370B,Grok3(Think),Kimi-k1.5,andStepR1-V-Miniwereexcludedfromtheanalysisduetomissingdata.EfficiencyresultsfortheremainingmodelsarepresentedinFigures TokenEfficiency Tobenchmarktokenefficiency,weemployedtheoutput–inputtokenratioasacoremetric,whereahigherratioindicateslowerefficiency.Thismetrichelpsnormalize ResultsshowthatBaichuan4-Turboleadswithanexceptionallylowratioof1.86,followedbyLlama3.370B(2.49),MiniMax-01(2.76),andStep2(2.78),allof In co

点击免费查看完整报告