Chen Liang1,4,Jiawen Zheng1,Yufeng Zeng1,Yi Tan3,Hengye Lyu1,Yuhui Zheng1,Zisu Li2,Yueting Weng4,Jiaxin Shi4 andHanwang Zhang1The Hong Kong University of Science and Technology (Guangzhou)2The Hong Kong University of Science and Technology3Nanyang Technological University4XMax.AI Ltd.Contact: chenliang2@hkust-gz.edu.cn, jiaxin@xmax.ai, hanwangzhang@ntu.edu.sgAbstract pipelines. This structure makes it difficult to synthesize high-fidelity interactions, such as fluid material behaviors, com-plex mechanical dynamics, and even the responsiveness ofliving creatures.Scaling toward broader expressive spacesoften increases authoring burden and system fragility: pro-ducing high-fidelity 3D assets demands substantial manual This paper introducesGenerative Augmented Re-ality(GAR) as a next-generation paradigm thatreframes augmentation as a process ofworld re-synthesisrather than world composition by a con-ventional AR engine.GAR replaces the conven-tional AR engine’s multi-stage modules with aunified generative backbone, where environmen-tal sensing, virtual content, and interaction signalsare jointly encoded as conditioning inputs for con-tinuous video generation.We formalize the com-putational correspondence between AR and GAR,survey the technical foundations that make real- In parallel, the rapid advancement of generative models,particularly in diffusion-based video generation models [Ho et al., 2022; Konget al., 2024], has introduced a fundamen-tally different way of constructing visual experience. Thesemodels are capable of producing temporally coherent, se- mantically grounded videos of and beyond both the physicaland the imaginary world contents from high-level conditionssuch as textual intent [Luoet al., 2023], motion cues [Baiet al., 2025], reference frames [Hu, 2024], or behavioral sig-nals [Guoet al., 2025]. Rather than treating scenes as fixedbackdrops for augmentation, generative video models repre-sent reality as a learnable, extendable process, where physi- 1Introduction Augmented Reality (AR) emerged as a response to the long-standing goal of blending digital content with physical en-vironments grounded in users’ real-world perception and ac-tion. Early formulations, such as Thomas and David [1992]’swork on overlaying digital instructions for aircraft assemblyand Milgram and Kishino [1994]’s Reality–Virtuality contin-uum, situated AR as an intermediate blend between virtualreality and physical reality.As advances in sensing, spa-tial tracking, and real-time rendering [Azuma, 1997a] madearXiv:2511.16783v1 [cs.HC] 20 Nov 2025 This paper presents a forward-looking conceptual and tech-nical survey of Generative Augmented Reality as a computa-tional framework for next-generation spatial computing. Our • We formalize the computational transition from compo-sitional AR pipelines to generative world re-synthesis,providing a comparative formulation of their perceptualgrounding, control flow, and asset management, and ren-dering mechanisms.• We survey the enabling technologies underlying GAR,including streaming video generation models, compu- However, as technological progress elevates expectationsfor content fidelity, interaction precision, and naturalistic re-sponsiveness in AR, the compositional paradigm underlyingconventional AR architectures reveals inherent constraints. building, and mixed-reality ecosystems. of real and virtual content, 2) real-time interactivity, and 3)accurate three-dimensional registration. Building on these principles, Craig [2013] and Billinghurstet al.[2015] summarized AR as a multidisciplinary syn-thesis of computer vision, graphics, sensing, and interacti-ton—designed to enable spatial coherence between the phys-ical and virtual worlds.These frameworks define AR as a 2Generative Augmented Reality: The NextGeneration of Spatial Computing and In this section, we present the paradigm of Generative Aug-mented Reality (GAR) in the context of the rapid develop-ment of generative video models. GAR rethinks the pathwaysto achieve augmentation of reality, representing a shift in the With recent works, Mendoza-Ram´ırezet al.[2023] high-light advances in semantic anchoring and adaptive contextmodeling that extend AR beyond geometric registration,while Audaet al.[2023] frame AR within cross-reality sys-tems emphasizing embodied and context-driven interaction.Together, these recent perspectives expand foundational def- To ground this paradigm, we first revisit the fundamen-tals of traditional augmented reality (AR), outline its tech-nology stack and implementation hierarchy, and then explainhow GAR transforms this architecture into a model-driven 2.2Traditional Augmented Reality Architectureand Technical Stacks 2.1Concept of Augmented Reality The conceptual basis of AR was first formalized by Mil-gramet al.[1995] through the Reality–Virtuality Continuum,which positioned AR within a spectrum ranging from purelyphysical to fully virtual environments. Later, Azuma [1997a] Fo