您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[Stability AI]:【英文原版】StableDiffusion3技术报告 - 发现报告
当前位置:首页/行业研究/报告详情/

【英文原版】StableDiffusion3技术报告

信息技术2024-03-13-Stability AI葛***
【英文原版】StableDiffusion3技术报告

Scaling Rectified Flow Transformers for High-Resolution Image SynthesisPatrick Esser*Sumith Kulal Andreas Blattmann Rahim Entezari Jonas M ̈uller Harry Saini Yam LeviDominik Lorenz Axel Sauer Frederic Boesel Dustin Podell Tim Dockhorn Zion EnglishKyle Lacey Alex Goodwin Yannik Marek Robin Rombach*Stability AIFigure 1.High-resolution samples from our 8B rectified flow model, showcasing its capabilities in typography, precise prompt followingand spatial reasoning, attention to fine details, and high image quality across a wide variety of styles.AbstractDiffusion models create data from noise by invert-ing the forward paths of data towards noise andhave emerged as a powerful generative modelingtechnique for high-dimensional, perceptual datasuch as images and videos. Rectified flow is a re-cent generative model formulation that connectsdata and noise in a straight line. Despite its bettertheoretical properties and conceptual simplicity, itis not yet decisively established as standard prac-tice. In this work, we improve existing noise sam-pling techniques for training rectified flow mod-els by biasing them towards perceptually relevantscales. Through a large-scale study, we demon-*Equal contribution .<first.last>@stability.ai.strate the superior performance of this approachcompared to established diffusion formulationsfor high-resolution text-to-image synthesis. Ad-ditionally, we present a novel transformer-basedarchitecture for text-to-image generation that usesseparate weights for the two modalities and en-ables a bidirectional flow of information betweenimage and text tokens, improving text comprehen-sion, typography, and human preference ratings.We demonstrate that this architecture follows pre-dictable scaling trends and correlates lower vali-dation loss to improved text-to-image synthesis asmeasured by various metrics and human evalua-tions. Our largest models outperform state-of-the-art models, and we will make our experimentaldata, code, and model weights publicly available.1 Scaling Rectified Flow Transformers for High-Resolution Image Synthesis1. IntroductionDiffusion models create data from noise (Song et al.,2020).They are trained to invert forward paths of data towardsrandom noise and, thus, in conjunction with approximationand generalization properties of neural networks, can beused to generate new data points that are not present inthe training data but follow the distribution of the trainingdata (Sohl-Dickstein et al.,2015;Song & Ermon,2020).This generative modeling technique has proven to be veryeffective for modeling high-dimensional, perceptual datasuch as images (Ho et al.,2020). In recent years, diffusionmodels have become the de-facto approach for generatinghigh-resolution images and videos from natural languageinputs with impressive generalization capabilities (Sahariaet al.,2022b;Ramesh et al.,2022;Rombach et al.,2022;Podell et al.,2023;Dai et al.,2023;Esser et al.,2023;Blattmann et al.,2023b;Betker et al.,2023;Blattmann et al.,2023a;Singer et al.,2022). Due to their iterative natureand the associated computational costs, as well as the longsampling times during inference, research on formulationsfor more efficient training and/or faster sampling of thesemodels has increased (Karras et al.,2023;Liu et al.,2022).While specifying a forward path from data to noise leads toefficient training, it also raises the question of which pathto choose. This choice can have important implicationsfor sampling. For example, a forward process that fails toremove all noise from the data can lead to a discrepancyin training and test distribution and result in artifacts suchas gray image samples (Lin et al.,2024). Importantly, thechoice of the forward process also influences the learnedbackward process and, thus, the sampling efficiency. Whilecurved paths require many integration steps to simulate theprocess, a straight path could be simulated with a singlestep and is less prone to error accumulation. Since each stepcorresponds to an evaluation of the neural network, this hasa direct impact on the sampling speed.A particular choice for the forward path is a so-calledRec-tified Flow(Liu et al.,2022;Albergo & Vanden-Eijnden,2022;Lipman et al.,2023), which connects data and noiseon a straight line. Although this model class has bettertheoretical properties, it has not yet become decisively es-tablished in practice. So far, some advantages have beenempirically demonstrated in small and medium-sized ex-periments (Ma et al.,2024), but these are mostly limited toclass-conditional models. In this work, we change this by in-troducing a re-weighting of the noise scales in rectified flowmodels, similar to noise-predictive diffusion models (Hoet al.,2020). Through a large-scale study, we compareour new formulation to existing diffusion formulations anddemonstrate its benefits.We show that the widely used approach for text-to-imagesynthesis, where a fixed text representation