行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

无人机视觉与语言导航：进展、挑战与研究路线图

国防军工 2026-04-15 - 未知机构华仔

UAV-VLN 旨在使无人机能够理解高级人类指令并在复杂 3D 环境中执行长期任务。本文从任务定义到当前技术现状进行了全面综述，并建立了方法论分类法，涵盖了从早期的模块化和深度学习方法到当代由大型基础模型驱动的智能系统（包括视觉语言模型 VLM、视觉语言行动模型 VLA 以及生成式世界模型与 VLA 架构的融合）。该综述系统地回顾了支持标准化研究的必要资源生态系统：模拟器、数据集和评估指标。此外，本文还对阻碍现实世界部署的主要挑战进行了批判性分析：模拟与现实之间的差距、动态户外环境中的鲁棒感知、语言歧义推理以及资源受限硬件上的高效部署。通过综合当前基准和局限性，该综述提出了一个面向未来的研究路线图，以指导对多智能体集群协调和空地协同机器人等关键前沿的探究。 UAV-VLN 问题被形式化为一个部分可观察的马尔可夫决策过程 (POMDP)，其中智能体的真实状态永远不会完全已知，必须从一系列不完整和嘈杂的感官数据中推断其状态。UAV-VLN 的主要方法分类法包括：模块化和早期学习方法，长时序时空理解架构，以及由基础模型驱动的智能体系统。该综述还概述了必要的模拟器、数据集和评估协议，并深入探讨了主要挑战，重点关注模拟与现实之间的差距以及鲁棒性、安全性和效率问题。最后，本文概述了一个研究路线图，朝着多智能体系统（UAV 集群协调）和空地协同机器人等未来前沿发展。

Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo,Shuai Yuan, Xiangyue Wang, Kangli Wang, and Ji Pei Abstract—Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodiedartificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to thecurrent state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deeplearning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs),Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures forphysically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and Index Terms—vision-language navigation, unmanned aerial vehicles, embodied AI, vision-language-action models, world models,sim-to-real transfer intelligent, language-driven autonomy has motivated a surgeof research [15], necessitating a structured synthesis to guidefuture progress in this domain of applied spatial intelligence 1Introduction Enabling an Unmanned Aerial Vehicle (UAV) to navigatea complex, three-dimensional world from a simple humancommand such as ” fly past the collapsed bridge and findpeople waving from the rooftops” represents a pivotal chal-lenge at the intersection of robotics, computer vision, andnatural language understanding. This capability, known asUAV-based Vision-and-Language Navigation (UAV-VLN), isa crucial subfield within the broader pursuit of embodied ar-tificial intelligence, which aims to develop autonomous agentsthat interpret linguistic instructions and execute long-horizontasks in the physical world [1], [2], [3]. The transition towards Thefield of UAV-VLN is currently experiencing aparadigm shift, propelled by the confluence of mature aerialplatforms and the transformative capabilities of large foun-dation models [18], [19]. This evolution marks a significantdeparture from earlier modular pipelines toward integratedEmbodied Multimodal Large Models (EMLMs) that unifyperception, reasoning, and control into a cohesive framework [20],[21],[22],[23].Most notably,the latest frontier in-volves the deep integration of generative world models withVision-Language-Action (VLA) policies, as seen in modelslikeπ0[24], GR00T N1 [25], and Cosmos-Reason1 [26], whichequip agents with physical common sense and predictive capa-bilities for robust, long-horizon reasoning. While substantialprogress has been made in Vision-and-Language Navigation(VLN) for ground-based robots since the seminal work on nav-igating from photorealistic images [27], [28], the aerial domainintroduces a distinct and more complex set of challenges, a gapfirst systematically addressed by benchmarks like AerialVLN[29]. These challenges, which have historically limited researchin outdoor aerial settings [30], include navigating continu-ous 3D action spaces without predefined graphs, performingarXiv:2604.13654v1 [cs.RO] 15 Apr 2026 operations and making sophisticated aerial platforms moreaccessible [4], [5], [6]. The real-world applications are pro-found, spanning time-critical search and rescue operations in GPS-denied environments [7], [8], [9], wildfire monitoring[10], automated inspection of large-scale infrastructure [11], Fig. 1: An overview of the UAV-based Vision-and-Language Navigation (UAV-VLN) research landscape, illustrating the corecomponents and methodological evolution. The process begins with a natural language command that directs an autonomousagent. This figure contrasts the traditional modular pipeline, which separates perception, reasoning, and control, with themodern integrated approach centered on Embodied Multimodal Large Models (EMLMs). The UAV executes the resulting vey of UAV-based Vision-and-Language Navigation, chartingthe field from its conceptual foundations to the current stateof the art. Our scope encompasses the entire research pipeline,commencing with the formal mathematical definition of thetask and tracing the methodological evolution from earlylearning systems to contemporary agents driven by foun-dation models [37], [38]. We survey the critical ecosystemof resources including high-fidelity simulators, benchmarkdatasets for diverse domains such as agriculture and urbanreconnaissance [39], [40], and standardized evaluation metricsthat underpins reproducible research. A core focus is the criti-cal analysis of fundamental challenges, namely the sim-to-realgap, perception robustness, reasoning with language ambigu- Process. Section 3 presents our methodological taxonomy,charting the evolution of agent architectures from modularand early learning approaches to

点击免费查看完整报告

无人机视觉与语言导航：进展、挑战与研究路线图

你可能感兴趣

通往L3智能驾驶与具身智能之钥——视觉-语言-动作模型(VLA)产业研究

欧盟委员会：2024欧洲无人机研究与创新报告：趋势、挑战与成就

推理机器学习：迈向人机协作视觉与语言模型

视觉语言建模遇见遥感：模型、数据集与前景展望

工业智能体进展情况、挑战与趋势研究

机器人抓取与操作竞赛的研究挑战与进展

大模型如何判决？从生成到判决：大型语言模型作为裁判的机遇与挑战

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

乌克兰氢能发展的机遇、挑战与路线图

2022年中国机器视觉行业市场规模将超200亿机遇与挑战并存（图）

无人机视觉与语言导航：进展、挑战与研究路线图

你可能感兴趣

通往L3智能驾驶与具身智能之钥——视觉-语言-动作模型(VLA)产业研究

欧盟委员会：2024欧洲无人机研究与创新报告：趋势、挑战与成就

推理机器学习：迈向人机协作视觉与语言模型

视觉语言建模遇见遥感：模型、数据集与前景展望

工业智能体进展情况、挑战与趋势研究

机器人抓取与操作竞赛的研究挑战与进展

大模型如何判决？从生成到判决：大型语言模型作为裁判的机遇与挑战

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

乌克兰氢能发展的机遇、挑战与路线图

2022年中国机器视觉行业市场规模将超200亿 机遇与挑战并存（图）

2022年中国机器视觉行业市场规模将超200亿机遇与挑战并存（图）