您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [未知机构]:无人机视觉与语言导航:进展、挑战与研究路线图 - 发现报告

无人机视觉与语言导航:进展、挑战与研究路线图

国防军工 2026-04-15 - 未知机构 华仔
报告封面

Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo,Shuai Yuan, Xiangyue Wang, Kangli Wang, and Ji Pei Abstract—Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodiedartificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to thecurrent state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deeplearning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs),Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures forphysically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and Index Terms—vision-language navigation, unmanned aerial vehicles, embodied AI, vision-language-action models, world models,sim-to-real transfer intelligent, language-driven autonomy has motivated a surgeof research [15], necessitating a structured synthesis to guidefuture progress in this domain of applied spatial intelligence 1Introduction Enabling an Unmanned Aerial Vehicle (UAV) to navigatea complex, three-dimensional world from a simple humancommand such as ” fly past the collapsed bridge and findpeople waving from the rooftops” represents a pivotal chal-lenge at the intersection of robotics, computer vision, andnatural language understanding. This capability, known asUAV-based Vision-and-Language Navigation (UAV-VLN), isa crucial subfield within the broader pursuit of embodied ar-tificial intelligence, which aims to develop autonomous agentsthat interpret linguistic instructions and execute long-horizontasks in the physical world [1], [2], [3]. The transition towards Thefield of UAV-VLN is currently experiencing aparadigm shift, propelled by the confluence of mature aerialplatforms and the transformative capabilities of large foun-dation models [18], [19]. This evolution marks a significantdeparture from earlier modular pipelines toward integratedEmbodied Multimodal Large Models (EMLMs) that unifyperception, reasoning, and control into a cohesive framework [20],[21],[22],[23].Most notably,the latest frontier in-volves the deep integration of generative world models withVision-Language-Action (VLA) policies, as seen in modelslikeπ0[24], GR00T N1 [25], and Cosmos-Reason1 [26], whichequip agents with physical common sense and predictive capa-bilities for robust, long-horizon reasoning. While substantialprogress has been made in Vision-and-Language Navigation(VLN) for ground-based robots since the seminal work on nav-igating from photorealistic images [27], [28], the aerial domainintroduces a distinct and more complex set of challenges, a gapfirst systematically addressed by benchmarks like AerialVLN[29]. These challenges, which have historically limited researchin outdoor aerial settings [30], include navigating continu-ous 3D action spaces without predefined graphs, performingarXiv:2604.13654v1 [cs.RO] 15 Apr 2026 operations and making sophisticated aerial platforms moreaccessible [4], [5], [6]. The real-world applications are pro-found, spanning time-critical search and rescue operations in GPS-denied environments [7], [8], [9], wildfire monitoring[10], automated inspection of large-scale infrastructure [11], Fig. 1: An overview of the UAV-based Vision-and-Language Navigation (UAV-VLN) research landscape, illustrating the corecomponents and methodological evolution. The process begins with a natural language command that directs an autonomousagent. This figure contrasts the traditional modular pipeline, which separates perception, reasoning, and control, with themodern integrated approach centered on Embodied Multimodal Large Models (EMLMs). The UAV executes the resulting vey of UAV-based Vision-and-Language Navigation, chartingthe field from its conceptual foundations to the current stateof the art. Our scope encompasses the entire research pipeline,commencing with the formal mathematical definition of thetask and tracing the methodological evolution from earlylearning systems to contemporary agents driven by foun-dation models [37], [38]. We survey the critical ecosystemof resources including high-fidelity simulators, benchmarkdatasets for diverse domains such as agriculture and urbanreconnaissance [39], [40], and standardized evaluation metricsthat underpins reproducible research. A core focus is the criti-cal analysis of fundamental challenges, namely the sim-to-realgap, perception robustness, reasoning with language ambigu- Process. Section 3 presents our methodological taxonomy,charting the evolution of agent architectures from modularand early learning approaches to