Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, Cheston Tan Abstract—There has been an emerging paradigm shift fromthe era of “internet AI” to “embodied AI”, where AI algorithmsand agents no longer learn from datasets of images, videos ortext curated primarily from the internet. Instead, they learnthrough interactions with their environments from an egocentricperceptionsimilar to humans.Consequently, there has beensubstantial growth in the demand for embodied AI simulatorsto support various embodied AI research tasks. This growinginterest in embodied AI is beneficial to the greater pursuit ofArtificial General Intelligence (AGI), but there has not been acontemporary and comprehensive survey of this field. This paperaims to provide an encyclopedic survey for the field of embodiedAI, from its simulators to its research. By evaluating nine current traditional intelligence concepts from vision, language, andreasoning into an artificialembodiment to help solve AI The growing interest in embodied AI has led to significantprogress in embodied AI simulators that aim to faithfullyreplicate the physical world. These simulated worlds serveas virtual testbeds to train and test embodied AI frameworksbefore deploying them into the real world. These embodied AIsimulators also facilitate the collection of task-based dataset[2], [3] which are tedious to collect in real-world as it requiresan extensive amount of manual labor to replicate the samesetting as in the virtual world. While there have been several To address the scarcity of contemporary comprehensivesurvey papers on this emerging field of embodied AI, wepropose this survey paper on the field of embodied AI, from itssimulators to research tasks. This paper covers the followingnine embodied AI simulators that were developed over the pastfour years: DeepMind Lab [12], AI2-THOR [13], CHALET[14], VirtualHome [15], VRKitchen [16], Habitat-Sim [17],iGibson [18], SAPIEN [19], and ThreeDWorld [20]. The cho-sen simulators are designed for general-purpose intelligencetasks, unlike game simulators [21] which are only used fortraining reinforcement learning agents. These embodied AI Index Terms—Embodied AI, Computer Vision, 3D Simulators. I. INTRODUCTION RECENT advances in deep learning, reinforcement learn-ing, computer graphics and robotics have garnered grow- arXiv:2103.04918v8 [cs.AI] 5 Jan 2022ing interest in developing general-purpose AI systems. As aresult, there has been a shift from “internet AI” that focuses onlearning from datasets of images, videos and text curated fromthe internet, towards “embodied AI” which enables artificialagents to learn through interactions with their surroundingenvironments. Embodied AI is the belief that true intelligencecan emerge from the interactions of an agent with its envi- ronment [1]. But for now, embodied AI is about incorporating Embodied AI simulators have given rise to a series ofpotential embodied AI research tasks, such asvisual explo-ration,visual navigationandembodied QA. We will focus onthese three tasks since most existing papers [11], [22], [23]in embodied AI either focus on these tasks or make use ofmodules introduced for these tasks to build models for morecomplex tasks like audio-visual navigation. These three tasksare also connected in increasing complexity. Visual explorationis a very useful component in visual navigation [22], [24] and This work was supported by the Agency for Science, Technology and Re-search (A*STAR), Singapore under its AME Programmatic Funding Scheme(Award #A18A2b0046) and the National Research Foundation, Singaporeunder its NRFISF Joint Call (Award NRF2015-NRF-ISF001-2541).J. Duan was with the Nanyang Technological University of Singapore,School of Electrical and Electronics Engineering, Singapore 639798, Singa-pore (e-mail: duan0038@e.ntu.edu.sg).S. Yu was with the Singapore Universityof Technology and Design,Singapore 487372, Singapore (e-mail: samsonyu@sutd.edu.sg).H.L. Tan, H. Zhu, and C. Tan are with the Institute for Infocomm Research,A*STAR, Singapore 138632, Singapore (e-mail:{duan jiafei, hltan, zhuh,cheston-tan}@i2r.a-star.edu.sg).Manuscriptaccepted December 4,2021,IEEE-TETCI.©2021 IEEE.Personaluse of this material is permitted.Permission from IEEE mustbe obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works. from 3D assets, while world-based scenes are constructedfrom real-world scans of the objects and the environment. A3D environment constructed entirely out of 3D assets oftenhas built-in physics features and object classes that are well-segmented when compared to a 3D mesh of an environmentmade from real-world scanning. The clear object segmentationfor the 3D assets makes it easy to model them as articulatedobjects with movable joi