您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [香港科技大学&南洋理工大学&XMax.AI]:生成式增强现实:范式、技术与未来应用 - 发现报告

生成式增强现实:范式、技术与未来应用

报告封面

Chen Liang1,4,Jiawen Zheng1,Yufeng Zeng1,Yi Tan3,Hengye Lyu1,Yuhui Zheng1,Zisu Li2,Yueting Weng4,Jiaxin Shi4 andHanwang Zhang1The Hong Kong University of Science and Technology (Guangzhou)2The Hong Kong University of Science and Technology3Nanyang Technological University4XMax.AI Ltd.Contact: chenliang2@hkust-gz.edu.cn, jiaxin@xmax.ai, hanwangzhang@ntu.edu.sgAbstract pipelines. This structure makes it difficult to synthesize high-fidelity interactions, such as fluid material behaviors, com-plex mechanical dynamics, and even the responsiveness ofliving creatures.Scaling toward broader expressive spacesoften increases authoring burden and system fragility: pro-ducing high-fidelity 3D assets demands substantial manual This paper introducesGenerative Augmented Re-ality(GAR) as a next-generation paradigm thatreframes augmentation as a process ofworld re-synthesisrather than world composition by a con-ventional AR engine.GAR replaces the conven-tional AR engine’s multi-stage modules with aunified generative backbone, where environmen-tal sensing, virtual content, and interaction signalsare jointly encoded as conditioning inputs for con-tinuous video generation.We formalize the com-putational correspondence between AR and GAR,survey the technical foundations that make real- In parallel, the rapid advancement of generative models,particularly in diffusion-based video generation models [Ho et al., 2022; Konget al., 2024], has introduced a fundamen-tally different way of constructing visual experience. Thesemodels are capable of producing temporally coherent, se- mantically grounded videos of and beyond both the physicaland the imaginary world contents from high-level conditionssuch as textual intent [Luoet al., 2023], motion cues [Baiet al., 2025], reference frames [Hu, 2024], or behavioral sig-nals [Guoet al., 2025]. Rather than treating scenes as fixedbackdrops for augmentation, generative video models repre-sent reality as a learnable, extendable process, where physi- 1Introduction Augmented Reality (AR) emerged as a response to the long-standing goal of blending digital content with physical en-vironments grounded in users’ real-world perception and ac-tion. Early formulations, such as Thomas and David [1992]’swork on overlaying digital instructions for aircraft assemblyand Milgram and Kishino [1994]’s Reality–Virtuality contin-uum, situated AR as an intermediate blend between virtualreality and physical reality.As advances in sensing, spa-tial tracking, and real-time rendering [Azuma, 1997a] madearXiv:2511.16783v1 [cs.HC] 20 Nov 2025 This paper presents a forward-looking conceptual and tech-nical survey of Generative Augmented Reality as a computa-tional framework for next-generation spatial computing. Our • We formalize the computational transition from compo-sitional AR pipelines to generative world re-synthesis,providing a comparative formulation of their perceptualgrounding, control flow, and asset management, and ren-dering mechanisms.• We survey the enabling technologies underlying GAR,including streaming video generation models, compu- However, as technological progress elevates expectationsfor content fidelity, interaction precision, and naturalistic re-sponsiveness in AR, the compositional paradigm underlyingconventional AR architectures reveals inherent constraints. building, and mixed-reality ecosystems. of real and virtual content, 2) real-time interactivity, and 3)accurate three-dimensional registration. Building on these principles, Craig [2013] and Billinghurstet al.[2015] summarized AR as a multidisciplinary syn-thesis of computer vision, graphics, sensing, and interacti-ton—designed to enable spatial coherence between the phys-ical and virtual worlds.These frameworks define AR as a 2Generative Augmented Reality: The NextGeneration of Spatial Computing and In this section, we present the paradigm of Generative Aug-mented Reality (GAR) in the context of the rapid develop-ment of generative video models. GAR rethinks the pathwaysto achieve augmentation of reality, representing a shift in the With recent works, Mendoza-Ram´ırezet al.[2023] high-light advances in semantic anchoring and adaptive contextmodeling that extend AR beyond geometric registration,while Audaet al.[2023] frame AR within cross-reality sys-tems emphasizing embodied and context-driven interaction.Together, these recent perspectives expand foundational def- To ground this paradigm, we first revisit the fundamen-tals of traditional augmented reality (AR), outline its tech-nology stack and implementation hierarchy, and then explainhow GAR transforms this architecture into a model-driven 2.2Traditional Augmented Reality Architectureand Technical Stacks 2.1Concept of Augmented Reality The conceptual basis of AR was first formalized by Mil-gramet al.[1995] through the Reality–Virtuality Continuum,which positioned AR within a spectrum ranging from purelyphysical to fully virtual environments. Later, Azuma [1997a] Fo