2.1Definition of Depth Foundation ModelsWe provide a brief overview of the development of founda-tion models in language model to facility the understandingof depth foundation models. The field of language modelhave experienced explosive growth with the establishmentof foundation models in recent years. This progress stemsfrom the ability of these models to learn universal languageand patterns from massive datasets, enabling them to gen-eralize powerfully across various downstream tasks.Convolutionalneural networks and long short-termmemory networks [51] plays as the main role at the earlystage of language models, with limited network and datascales. The concept of Word Embeddings [52] and the in-troduction of the self-attention Mechanism [53], allowed themodel to process all words in a sequence simultaneously,vastly improving parallel computation efficiency and theability to capture long-range dependencies. The originalTransformer model had a relatively small number of pa-rameters, but its architecture laid the groundwork for sub-sequent large-scale models. BERTs [54] and GPTs [55] can beconsidered as the beginning of foundation models in largelanguage models (LLMs). Proposed by Google, BERT is abidirectional pre-trained model based on the Transformerarchitecture, enabling better understanding the polysemy of This paper aims to survey the evolution towards depthfoundation models and paradigms for depth estimationacross the monocular, stereo, multi-view, and monocularvideo settings.•We explore the development of deep learning modelarchitectures and learning paradigms for each taskand identify key paradigms with foundational capa-bility or potential.•To aid the development of such depth foundationmodels, we also provide comprehensive surveys onlarge-scale datasets in each respective subfield.•We also list the current key challenges faced by thefoundational architectures in each task to provideinsight into future works.2SURVEYSCOPEThispaper primarily concentrates on depth estimationmethods that leverage deep learning, with a particular em-phasis on foundation models that utilize large-scale archi-tectures and extensive datasets. We begin by defining depthfoundation models and then outline the depth estimationtasks that will be addressed in the following sections. words in a sentence. Bert is trained on Toronto BookCor-pus (800 million words) and English Wikipedia (2.5 billionwords). BERT-Base has 110 million parameters, and BERT-Large has 340 million parameters. Proposed by OpenAI,GPT is a unidirectional generative pre-trained model basedon the Transformer architecture. GPT models learn languagepatterns by predicting the next word, excelling in text gener-ation tasks. The GPT-3 is trained on a dataset which is largerthan 45 TB, along with 175 billion parameters.The development of depth estimation models is illus-trated in Fig. 1. Considering the foundation models scale inthe areas of language models, we define a depth foundationmodel as one that is trained on a large-scale dataset (over10 million images) and employs models with a substantialnumber of parameters (over 1 billion). Additionally, depthfoundation models should exhibit strong generalizabilityacross multiple data domains.2.2Depth Estimation TasksThis survey covers several tasks, including monocular depthestimation, stereo depth estimation, multi-view depth esti-mation, and monocular video depth estimation using foun-dation models. LetI={Ik,t, k= 1, ...,K, t= 1, ...,T }represent a collection of RGB images, whereKdenotes thenumber of cameras andT is the number of timestampsfor the frames. In the case of monocular depth estimation,the input consists of a single imageI1,1. For stereo depthestimation, the input comprises a pair of images{I1,1, I2,1}.In multi-view depth estimation, the input is a set of imagescaptured at the same timestamp but varying in spatiallocations, represented as{Ik,1, k= 1, ...,K}. For monocularvideo depth estimation, the input consists of a sequenceof images captured by a monocular camera at differenttimestamps, represented as{I1,t, t= 1, ...,T }. The scopeof our survey excludes the task of multi-view video depthestimation, which can be represented as the most generalform of inputs:{Ik,t, k= 1, ...,K, t= 1, ...,T }. This is dueto the fact that foundation models for this task have not yetbeen thoroughly explored.For each task, we begin by reviewing the backgroundandevolution of deep learning models specific to thetask. We then delve into the development of foundationmodels. Prominent examples of foundation models includetransformer-based models and diffusion models. Further-more, we also discuss the large-scale datasets used fortraining these foundation models, encompassing both syn-thetic and real-world datasets, which enable the models togeneralize effectively across diverse scenes. Finally, we ad-dress valuable problems faced by existing depth foundationmodels.3OVERVIEW OFDEPTHESTIMATIONIn this section, we provide a overlook




