AI智能总结
基于物理条件约束的可信视觉生成大模型 朱思语复旦大学 演讲嘉宾 朱思语 复旦大学教授 复旦大学人工智能创新与产业研究院研究员,长聘正教授,博士生导师。朱思语本科毕业于浙江大学,博士毕业于香港科技大学。在博士阶段,作为联合创始人创立了3D视觉公司Alituzre,并后来被苹果公司收购。2017年至2023年,在阿里云人工智能实验室担任总监。2023年起,任职于复旦大学人工智能创新与产业研究院,担任研究员和博士生导师。朱思语的主要研究方向包括视频和三维生成式模型,涉及基于视觉的三维和视频的重建、生成、理解、方针和模拟。他发表了60余篇高水平会议和期刊论文,包括CVPR、ICCV、ICLR和TPAMI等计算机视觉和机器学习领域,包括Hallo,Champ,AnimateAnything等有一定行业影响力的视频生成大模型。在40余个计算机视觉国际比赛和榜单上取得第一名。 Visual generative model Output Video generative methods •The field of video generation has seen rapid development,reaching severalmilestones... Diffusion for visual generation (1) •Denoising Diffusion Probabilistic Models (DDPMs) Diffusion for visual generation (2) •Stochastic Differential Equations (Score SDEs) Key Elements of visual Diffusion Models •Unet•Transformer •Latent space diffusion Sora, breakthrough •Consistency: consistency in 3D rendering, long-range coherence, andobject permanence. •High fidelity. •Surprising length: extended video length capability (Sora: 1 minute vs.previous systems: seconds). •Flexible resolution: generation of videos across various durations,aspect ratios, and resolutions. Sora, key technologies •TheDiTframework by Meta (2022.12) is designed for videoprocessing. •Google'sMAGViT(2022.12) focuses on Video Tokenization.•Google DeepMind introducedNaViT(2023.07) to support variousresolutions and aspect ratios.•OpenAI'sDALL-E 3(2023.09) enhances Video Caption generation forimproved conditioned video creation. Modeling the physical world •We know that it is very complicated real physical model. probabilistic •bayesian inference;•probabilistic graphical models. deterministic •mathematical equations;•physics based simulation;•control theory. Modeling the physical world •We know that it is very complicated real physical model. probabilistic •bayesian inference;•probabilistic graphical models. deterministic •mathematical equations;•physics based simulation;•control theory. Key elements of a physical world •Given a Sora demo (the walking woman in the Tokyo street), the key elements ofa physical world, in the graphical way... •Appearance•Geometry•Lighting•Motion&Animation•Audio Modeling the physical world •[CVPR] Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle Modeling the physical world •[CVPR] Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle It is hard to model the physical world •In fact, the world is hard to model in aprobablisticway. •Sora resource consumption...–1 billions of images;–1 millions of hours of video data;–10 trillions tokens after tokenizing images and videos–Training with ~5,000 A100s in parallel. It is hard to model the physical world •Sora failure case in geometry and appearance. It is hard to model the physical world •Sora failure case in lighting. It is hard to model the physical world •Sora failure case in motion and animation. It is hard to model the physical world •Geometric enhancement is still needed for multi-view images. It is hard to model the physical world •VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model •From astaticaspects, SVD is able to model multi-view images. It is hard to model the physical world •Stag4D: Spatial-Temporal Anchored Generative 4D Gaussians •From a temporal aspects... It is hard to model the physical world •STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians •From atemporalaspects... It is hard to model the physical world •IlyaSutskever: compression is generalization.•The best lossless compression for a dataset is the bestgeneralization for data outside the dataset. Applythe deterministic conditions •Different representations ofdeterministic conditionsin the physical world. •Much less data and parameters! Motion&Animation Applythe deterministic conditions •There are two ways to inject deterministic information. Image Human Animation •Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance Image Human Animation •Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance Image Human Animation •Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance Image Portrait Animation •Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation Image Portrait Animation •Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation Image Portrait Animation •Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation Dynamic Protein Structure Prediction •4D Diffusion for Dynamic Protein Structure Prediction with Reference Guided Temporal Alignment Dynamic Protein Structure Prediction •4D Diffusion for Dynamic Protein Structure Prediction with Reference Guided Temporal Alignment Future work •Apply deterministic conditions to probabilistic diffusion.•Less data andparamters! THANKS