行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

迈向深度基础模型：基于视觉的深度估计最新趋势

信息技术 2025-07-15 - 浙江大学&深圳大学&上海AI实验室 Silent

深度基础模型研报正文总结

2.1 深度基础模型的定义

深度基础模型是在语言模型领域发展的基础上，通过学习大规模数据集和大规模模型参数，实现对多种数据域的强泛化能力。深度基础模型通常需要满足以下条件：训练数据集规模超过 1000 万张图像，模型参数数量超过 10 亿，并具备跨多个数据域的强泛化能力。

2.2 深度估计任务

本研报主要关注利用深度学习进行深度估计的方法，特别是使用大规模架构和大量数据集的基础模型。深度估计任务包括：

单目图像深度估计：从单张 RGB 图像预测场景的逐像素深度。
立体图像深度估计：利用双目相机图像对进行深度估计，具有尺度信息。
多视图图像深度估计：利用多个视角的图像进行深度估计，融合多视图信息。
单目视频深度估计：从单目视频序列进行深度估计，需要考虑时间一致性。

2.3 深度估计概述

深度估计方法经历了从传统几何方法到数据驱动端到端学习范式的演变。主要范式包括：

单目图像深度估计：从直接深度回归到仿射不变深度、深度分类和规范相机深度，逐步提高精度和泛化能力。
立体图像深度估计：从成本体积方法到注意力机制、迭代优化技术，并利用单目和扩散先验进行大规模训练。
多视图图像深度估计：从基于 PatchMatch 的立体匹配策略到成本体积方法、由粗到细的策略和 token 注意力机制。
单目视频深度估计：利用时间相关性和测试时优化范式，结合尺度扩展，提高深度估计的稳定性和准确性。

2.4 单目图像深度估计

单目图像深度估计方法从早期的直接回归和分类方法，发展到仿射不变深度、规范相机深度和深度分类，并逐步转向深度基础模型。主要方法包括：

模型架构：从早期的卷积神经网络到 U-Net、ResNet，再到密集预测 Transformer (DPT) 和基于扩散的生成模型。
数据规模：从数千到数百万张图像，并倾向于使用大规模数据集进行训练。
有价值的问题：深度精度、绝对尺度恢复、数据效率瓶颈和多任务泛化。

2.5 立体图像深度估计

立体图像深度估计方法从成本体积方法发展到注意力机制和迭代优化技术，并利用单目和扩散先验进行大规模训练。主要方法包括：

模型架构：基于卷积神经网络的成本体积范式、基于 Transformer 的注意力范式和基于 RNN 的迭代优化范式。
基础模型：利用单目深度模型作为先验，或利用立体几何尺度单目深度估计。
数据规模：从高保真虚拟场景到从单目数据合成立体数据。
有价值的问题：数据有限、缺乏端到端训练范式、数据集利用率不足和扩散架构和先验的利用不足。

2.6 多视图图像深度估计

多视图图像深度估计方法从早期的启发式匹配发展到基于 Transformer 和扩散的模型。主要方法包括：

模型架构：从基于 PatchMatch 的匹配策略到成本体积方法、由粗到细的策略和 token 注意力机制。
基础模型：主要基于 Transformer 架构，如 MVSTR、MVSFormer、MVSFormer++、PF-LRM 和 DUSt3R。
数据规模：从小规模数据集发展到大规模训练，如 Objaverse 和 MVImgNet 数据集。
有价值的问题：稀疏视图重建、细粒度深度估计和复杂材料物体的深度估计。

2.7 单目视频深度估计

单目视频深度估计方法利用时间相关性和测试时优化范式，结合尺度扩展，提高深度估计的稳定性和准确性。主要方法包括：

模型架构：从基于 RNN 的时间建模到基于 CNN 的测试时优化，再到基于 Transformer 的扩展和基于视频扩散的模型。
数据规模：从 NVDS 数据集到利用视频生成模型创建的合成数据。
有价值的问题：几何不一致性和时间不一致性。

2.8 视频世界模型

视频扩散模型可以用于视频生成，并探索其作为世界模型的能力。深度基础模型可以增强视频生成模型对 3D 空间的理解。

2.9 机器人与自动驾驶

深度基础模型在机器人与自动驾驶领域具有重要作用，可以用于导航、障碍物检测和碰撞避免等任务。单目深度估计可以集成到机器人感知管道中，作为更昂贵传感器的补充或替代。

2.10 未来工作

深度基础模型面临的主要挑战是数据和一致性：

数据：缺乏准确、大规模、高质量和高可变性的数据集。
一致性：单目图像深度估计结果在不同时间戳和视口之间难以合并，视频深度估计在多个视口之间难以产生准确和一致的结果。

2.11 结论

深度基础模型在计算机视觉领域取得了显著进展，为 3D 重建、新视角合成、机器人与自动驾驶等应用提供了新的可能性。未来需要解决数据和一致性问题，以实现通用深度基础模型。

2.1Definition of Depth Foundation ModelsWe provide a brief overview of the development of founda-tion models in language model to facility the understandingof depth foundation models. The field of language modelhave experienced explosive growth with the establishmentof foundation models in recent years. This progress stemsfrom the ability of these models to learn universal languageand patterns from massive datasets, enabling them to gen-eralize powerfully across various downstream tasks.Convolutionalneural networks and long short-termmemory networks [51] plays as the main role at the earlystage of language models, with limited network and datascales. The concept of Word Embeddings [52] and the in-troduction of the self-attention Mechanism [53], allowed themodel to process all words in a sequence simultaneously,vastly improving parallel computation efficiency and theability to capture long-range dependencies. The originalTransformer model had a relatively small number of pa-rameters, but its architecture laid the groundwork for sub-sequent large-scale models. BERTs [54] and GPTs [55] can beconsidered as the beginning of foundation models in largelanguage models (LLMs). Proposed by Google, BERT is abidirectional pre-trained model based on the Transformerarchitecture, enabling better understanding the polysemy of This paper aims to survey the evolution towards depthfoundation models and paradigms for depth estimationacross the monocular, stereo, multi-view, and monocularvideo settings.•We explore the development of deep learning modelarchitectures and learning paradigms for each taskand identify key paradigms with foundational capa-bility or potential.•To aid the development of such depth foundationmodels, we also provide comprehensive surveys onlarge-scale datasets in each respective subfield.•We also list the current key challenges faced by thefoundational architectures in each task to provideinsight into future works.2SURVEYSCOPEThispaper primarily concentrates on depth estimationmethods that leverage deep learning, with a particular em-phasis on foundation models that utilize large-scale archi-tectures and extensive datasets. We begin by defining depthfoundation models and then outline the depth estimationtasks that will be addressed in the following sections. words in a sentence. Bert is trained on Toronto BookCor-pus (800 million words) and English Wikipedia (2.5 billionwords). BERT-Base has 110 million parameters, and BERT-Large has 340 million parameters. Proposed by OpenAI,GPT is a unidirectional generative pre-trained model basedon the Transformer architecture. GPT models learn languagepatterns by predicting the next word, excelling in text gener-ation tasks. The GPT-3 is trained on a dataset which is largerthan 45 TB, along with 175 billion parameters.The development of depth estimation models is illus-trated in Fig. 1. Considering the foundation models scale inthe areas of language models, we define a depth foundationmodel as one that is trained on a large-scale dataset (over10 million images) and employs models with a substantialnumber of parameters (over 1 billion). Additionally, depthfoundation models should exhibit strong generalizabilityacross multiple data domains.2.2Depth Estimation TasksThis survey covers several tasks, including monocular depthestimation, stereo depth estimation, multi-view depth esti-mation, and monocular video depth estimation using foun-dation models. LetI={Ik,t, k= 1, ...,K, t= 1, ...,T }represent a collection of RGB images, whereKdenotes thenumber of cameras andT is the number of timestampsfor the frames. In the case of monocular depth estimation,the input consists of a single imageI1,1. For stereo depthestimation, the input comprises a pair of images{I1,1, I2,1}.In multi-view depth estimation, the input is a set of imagescaptured at the same timestamp but varying in spatiallocations, represented as{Ik,1, k= 1, ...,K}. For monocularvideo depth estimation, the input consists of a sequenceof images captured by a monocular camera at differenttimestamps, represented as{I1,t, t= 1, ...,T }. The scopeof our survey excludes the task of multi-view video depthestimation, which can be represented as the most generalform of inputs:{Ik,t, k= 1, ...,K, t= 1, ...,T }. This is dueto the fact that foundation models for this task have not yetbeen thoroughly explored.For each task, we begin by reviewing the backgroundandevolution of deep learning models specific to thetask. We then delve into the development of foundationmodels. Prominent examples of foundation models includetransformer-based models and diffusion models. Further-more, we also discuss the large-scale datasets used fortraining these foundation models, encompassing both syn-thetic and real-world datasets, which enable the models togeneralize effectively across diverse scenes. Finally, we ad-dress valuable problems faced by existing depth foundationmodels.3OVERVIEW OFDEPTHESTIMATIONIn this section, we provide a overlook

点击免费查看完整报告

迈向深度基础模型：基于视觉的深度估计最新趋势

深度基础模型研报正文总结

2.1 深度基础模型的定义

2.2 深度估计任务

2.3 深度估计概述

2.4 单目图像深度估计

2.5 立体图像深度估计

2.6 多视图图像深度估计

2.7 单目视频深度估计

2.8 视频世界模型

2.9 机器人与自动驾驶

2.10 未来工作

2.11 结论

你可能感兴趣

基于价量数据的非线性估计模型（2024年第9期）：量化信号未发生延续，预期偏强震荡格局

基于价量数据的非线性估计模型（2024年第4期）：预期以震荡为主，行业可能发生切换

权益配置因子研究系列07：基于Barra CNE6的A股风险模型实践：股票协方差矩阵估计篇

基于价量数据的非线性估计模型（2024年第16期）：市值结构继续分化，行业信号发生切换

基于价量数据的非线性估计模型（2024年第15期）：反弹后信号转弱，周期与小盘相对较强

加快户籍制度改革对扩大内需的影响——一个基于面板模型的定量估计

基于价量数据的非线性估计模型（2024年第13期）：3000点下方信号转弱，提前关注尾部风险

国君金融工程|基于Barra CNE6的A股风险模型实践:股票协方差矩阵估计篇

基于价量数据的非线性估计模型（2024年第7期）：量化信号势能转弱，行业热度发生分化

基于价量数据的非线性估计模型（2024年第5期）：行业切换预期兑现，量化信号全面升温