Stable Diffusion Clearly Explained!

How does Stable Diffusion paint an AI artwork? Understanding the tech behind the rise of AI-generated art.

An image generated using Stable Diffusion
An image generated using Stable Diffusion

Originally posted on My Medium.

Most of the recent AI art found on the internet is generated using the Stable Diffusion model. Since it is an open-source tool, any person can easily create fantastic art illustrations from just a text prompt.

In this article, I’m going to explain how it works.

Diffusion Model

If you want to understand the full details of the Diffusion Model, you can check out my previous article:

Here I will walk you through the rough idea of the Diffusion Model.

Overview of the Diffusion Model
Overview of the Diffusion Model

The training of the Diffusion Model can be divided into two parts:

  1. Forward Diffusion Process → add noise to the image.
  2. Reverse Diffusion Process → remove noise from the image.

Forward Diffusion Process

The forward diffusion process adds Gaussian noise to the input image step by step. Nonetheless, it can be done faster using the following closed-form formula to directly get the noisy image at a specific time step t:

x_t = \sqrt{\bar{\alpha}_t}\ x_0 + \sqrt{1-\bar{\alpha}_t}\ \varepsilon

The closed-form formula

Reverse Diffusion Process

Since the reverse diffusion process is not directly computable, we train a neural network \boldsymbol{\varepsilon}_\theta to approximate it.

The training objective (loss function) is as follows:

Simplified training objective
Simplified training objective

Training

In each epoch:

  1. A random time step t will be selected for each training sample (image).
  2. Apply the Gaussian noise (corresponding to t) to each image.
  3. Convert the time steps to embeddings (vectors).
Dataset for training
Dataset for training
Training step illustration
Training step illustration

Sampling

Sampling means painting an image from Gaussian noise. The following diagram shows how we can use the trained U-Net to generate an image:

Sampling illustration
Sampling illustration

Diffusion Speed Problem

As you can see, the diffusing (sampling) process iteratively feeds a full-sized image to the U-Net to get the final result. This makes the pure Diffusion model extremely slow when the number of total diffusing steps T and the image size are large.

Hereby, Stable Diffusion is designed to tackle this problem.


Stable Diffusion

The original name of Stable Diffusion is “Latent Diffusion Model” (LDM). As its name points out, the Diffusion process happens in the latent space. This is what makes it faster than a pure Diffusion model.

Departure to Latent Space

Autoencoder
Autoencoder

We will first train an autoencoder to learn to compress the image data into lower-dimensional representations.

  • By using the trained encoder E, we can encode the full-sized image into lower dimensional latent data (compressed data).
  • By using the trained decoder D, we can decode the latent data back into an image.

Latent Diffusion

After encoding the images into latent data, the forward and reverse diffusion processes will be done in the latent space.

Overview of the Stable Diffusion model
Overview of the Stable Diffusion model
  1. Forward Diffusion Process → add noise to the latent data.
  2. Reverse Diffusion Process → remove noise from the latent data.

Conditioning

Overview of the conditioning mechanism
Overview of the conditioning mechanism

The true power of the Stable Diffusion model is that it can generate images from text prompts. This is done by modifying the inner diffusion model to accept conditioning inputs.

Conditioning mechanism details
Conditioning mechanism details

The inner diffusion model is turned into a conditional image generator by augmenting its denoising U-Net with the cross-attention mechanism.

The switch in the above diagram is used to control between different types of conditioning inputs:

  • For text inputs, they are first converted into embeddings (vectors) using a language model \tau_\theta (e.g. BERT, CLIP), and then they are mapped into the U-Net via the (multi-head) \text{Attention}(Q, K, V) layer.
  • For other spatially aligned inputs (e.g. semantic maps, images, inpainting), the conditioning can be done using concatenation.

Training

Training objective for the Stable Diffusion model
Training objective for the Stable Diffusion model

The training objective (loss function) is pretty similar to the one in the pure diffusion model. The only changes are:

  • Input latent data z_t instead of the image x_t.
  • Added conditioning input \tau_\theta(y) to the U-Net.

Sampling

Stable Diffusion sampling process (denoising)
Stable Diffusion sampling process (denoising)

Since the size of the latent data is much smaller than the original images, the denoising process will be much faster.

Architecture Comparison

Finally, let’s compare the overall architectures of the pure diffusion model and the stable diffusion model (latent diffusion model).

Pure Diffusion Model

Pure diffusion model architecture
Pure diffusion model architecture

Stable Diffusion (Latent Diffusion Model)

Stable Diffusion architecture
Stable Diffusion architecture

Summary

To quickly summarize:

  • Stable Diffusion (Latent Diffusion Model) conducts the diffusion process in the latent space, and thus it is much faster than a pure diffusion model.
  • The backbone diffusion model is modified to accept conditioning inputs such as text, images, semantic maps, etc.

References

[1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with Latent Diffusion Models,” arXiv.org, 13-Apr-2022. [Online]. Available: https://arxiv.org/abs/2112.10752.

[2] J. Alammar, “The Illustrated Stable Diffusion,” The Illustrated Stable Diffusion — Jay Alammar — Visualizing machine learning one concept at a time. [Online]. Available: https://jalammar.github.io/illustrated-stable-diffusion/.

[3] A. Gordić, “Stable diffusion: High-resolution image synthesis with latent diffusion models | ML coding series,” YouTube, 01-Sep-2022. [Online]. Available: https://www.youtube.com/watch?v=f6PtJKdey8E.

Avatar photo
Steins

Developer & AI Researcher. Write about AI, web dev/hack.