StyleGAN vs StyleGAN2 vs StyleGAN2-ADA vs StyleGAN3
Originally posted on My Medium.
In this article, I will compare and show you the evolution of StyleGAN, StyleGAN2, StyleGAN2-ADA, and StyleGAN3.
Note: some details will not be mentioned since I want to make it short and only talk about the architectural changes and their purposes.
StyleGAN
The purpose of StyleGAN is to synthesize photorealistic/high-fidelity images.
The architecture of the StyleGAN generator might look complicated at the first glance, but it actually evolved from ProGAN (Progressive GAN) step by step.
Step 1: Mapping and Styles
- Instead of feeding the latent code \textbf{z} directly into the input layer, we feed it into a mapping network f to obtain a latent code \textbf{w}.
- We then replace the PixelNorm with AdaIN (responsible for styling).
Let’s zoom into the styling module and see how AdaIN styles the intermediate feature map \textbf{x}_i. Here, \text{A} stands for a learned affine transformation.
The learned affine transformation \text{A} will transform the latent code \textbf{w} to style \textbf{y}. Hence, the feature map \textbf{x}_i is normalized by \mu and \sigma, and then denormalized by the style \textbf{y}.
I have explained how and why this will affect the style of the feature map/activation in my previous article, please read it if you want to understand better.
Step 2: Constant Input
In traditional GAN, we input the latent code to the first layer of the synthesis network. In StyleGAN, we replace it with a constant vector.
We make a surprising observation that the network no longer benefits from feeding the latent code into the first convolution layer. We therefore simplify the architecture by removing the traditional input layer and starting the image synthesis from a learned 4 × 4 × 512 constant tensor.
It is found that the synthesis network can produce meaningful results even though it receives input only through the styles that control the AdaIN operations.
Step 3: Noise Inputs
We then introduce explicit noise inputs for generating stochastic details (e.g. hairs, facial details). Here, \text{B} stands for a learned scaling factor.
The noise is broadcasted to all feature maps using learned per-feature scaling factors \text{B} and then added to the output of the corresponding convolution.
We can see that the noise affects only the stochastic aspects, leaving the overall composition and high-level aspects such as identity intact.
Step 4: Style Mixing
By employing mixing regularization, we can mix the styles of different latent codes.
To be specific, we run two latent codes \textbf{z}_1, \textbf{z}_2 through the mapping network, and have the corresponding \textbf{w}_1, \textbf{w}_2 control the styles so that \textbf{w}_1 applies before the crossover point and \textbf{w}_2 after it.
Two sets of images were generated from their respective latent codes (sources \text{A} and \text{B}); the rest of the images were generated by copying a specified subset of styles from source \text{B} and taking the rest from source \text{A}.
- Coarse: copying the styles of coarse resolutions (4×4 to 8×8) brings high-level aspects such as pose, general hairstyle, face shape, and eyeglasses from \text{B}, while all colors (eyes, hair, lighting) and finer facial features resemble \text{A}.
- Middle: copying the styles of middle resolutions (16×16 to 32×32) brings smaller scale facial features, hairstyle, eyes open/closed from \text{B}, while the pose, general face shape, and eyeglasses from A are preserved.
- Fine: copying the fine styles (64×64 to 1024×1024) from \text{B} brings mainly the color scheme and microstructure.
StyleGAN2
The main purpose of StyleGAN2 is to tackle the water-droplet artifacts that appeared in StyleGAN images.
Reason
The researchers pinpointed the problem to the AdaIN operation.
Before we talk about it, let’s break down the AdaIN operation into two parts: Normalization and Modulation.
It is found that when the Normalization step is removed from the generator, the droplet artifacts disappear completely.
The researchers then hypothesize that the droplet artifact is a result of the generator intentionally sneaking signal strength information passing through the instance normalization step of AdaIN.
Hypothesis:
By creating a strong, localized spike that dominates the statistics, the generator can effectively scale the signal as it likes elsewhere.
Original StyleGAN Design
The above diagram is simply the architecture of the original StyleGAN generator. We redrew the diagram by breaking down the AdaIN operation into Norm mean/std and Mod mean/std (Normalization and Modulation).
In addition, we explicitly annotated the learned weights (w), biases (b), and the constant input (c) on the diagram. Also, we redrew the gray boxes so that only one style is active per box.
Changes #1
- We removed some redundant operations at the beginning.
- We moved the addition of b and \text{B} to be outside the active area of style (we observed that more predictable results are obtained by doing this).
- It is sufficient for the normalization and modulation to operate on the standard deviation alone (i.e. the mean is not needed).
Changes #2
- We combine the Mod std and Conv operations to be w_{ijk}' = s_i \cdot w_{ijk}.
- We change Norm std to become weight demodulation.
Combine Mod std and Conv
The reason why we can combine the Mod std and Conv is that their effect is:
The modulation scales each input feature map of the convolution based on the incoming style, which can alternatively be implemented by scaling the convolution weights: w_{ijk}' = s_i \cdot w_{ijk}.
- w: original weights
- w': modulated weights
- s_i: the scale corresponding to the ith input feature map
- j, k: spatial indices of the output feature maps
Weight Demodulation
The purpose of instance normalization is to essentially remove the effect of s from the statistics of the convolution’s output feature maps (see the above figure).
If the input activations are distributions with unit standard deviation (sd=1), the standard deviation of the output activations will be:
\sigma_j = \sqrt{\sum_{i,k} {w_{ijk}'}^2}
Therefore, w_{ijk}' is normalized (demodulated) as follows:
w_{ijk}'' = w_{ijk}' \bigg/ \sqrt{\sum_{i,k} {w_{ijk}'}^2 + \epsilon}
All in all, the new Mod and Demod operations look like this:
Changes #3
Instead of using the progressive growth method, StyleGAN2 explores skip connections and ResNets design to produce high-quality images.
Up and Down denote bilinear up and down-sampling respectively. tRGB and fRGB represent “to RGB” and “from RGB” respectively.
StyleGAN2 Results
As a result, replacing normalization with demodulation removes the characteristic artifacts from images and activations. This is because:
Compared to instance normalization, our demodulation technique is weaker because it is based on statistical assumptions about the signal instead of actual contents of the feature maps.
StyleGAN2-ADA
The purpose of StyleGAN2-ADA is to design a method to train GAN with limited data, where ADA stands for Adaptive Discriminator Augmentation.
Stochastic Discriminator Augmentation
\text{G} and \text{D} stand for the generator and discriminator respectively.
Where p \in [0, 1] is the augmentation probability that controls the strength of the augmentations.
The discriminator \text{D} rarely sees a clean image because
- We have many augmentations in the pipeline
- The value of p will be set around 0.8
The generated images will be augmented before being evaluated by the discriminator \text{D} during training. Since the Aug operation is being put after the generation, the generator \text{G} is guided to produce only clean images.
From the experiment, it is found that the following 3 types of augmentations are the most effective:
- pixel blitting (x-flips, 90° rotations, integer translation)
- geometric
- color transforms
Adaptive Discriminator Augmentation
In the previous stochastic discriminator augmentation setup, we used the same value of p for all the transformations. However, we would like to avoid manual tuning of the augmentation strength p and instead control it dynamically based on the degree of overfitting.
Let’s denote \text{D}_\text{train}, \text{D}_\text{validation}, \text{D}_\text{generated}, and \text{D}_\text{real} as the discriminator outputs of the training set, validation set, generated images, and testing set respectively.
The researchers designed the following heuristics to quantify overfitting.
Heuristic #1: r_v
Idea: Since the training set and the validation set are simply split sets from the real images, \text{D}_\text{validation} should be closer to \text{D}_\text{real} / \text{D}_\text{train} ideally. However, when it overfits, the \text{D}_\text{validation} (green) will get closer to \text{D}_\text{generated} (orange).
The drawback of this heuristic is that it requires a validation set. Since we already have so little data, we don’t want to further split the dataset. Therefore, we include it mainly as a comparison method.
Heuristic #2: r_t
In training, we will use the second heuristic r_t.
Idea: \text{D}_\text{real} / \text{D}_\text{train} and \text{D}_\text{generated} diverge symmetrically around zero when the situation gets worse (overfitting).
r_t estimates the portion of the \text{D}_\text{train} getting a positive value. If \text{D}_\text{train} diverges from zero, it is overfitting; otherwise, it is not overfitting.
Adjust p Using r_t
- Initialize p = 0
- Adjust p every 4 mini-batches based on the following conditions:
- If r_t is high → overfitting → increase p (augment more)
- If r_t is low → not overfitting → decrease p (augment less)
StyleGAN2-ADA Results
(a) Training curves for FFHQ with different training set sizes using adaptive augmentation.
(b) The supports of real generated images continue to overlap.
(c) Example magnitudes of the gradients the generator receives from the discriminator as the training progresses.
StyleGAN3
The remaining parts are the main points that I summarized in my previous article: “StyleGAN3 Clearly Explained!”. If you want to know the details of StyleGAN3, please directly go to this link and continue reading about it.
The purpose of StyleGAN3 is to tackle the “texture sticking” issue that happened in the morphing transition (e.g. morphing from one face to another face).
In other words, StyleGAN3 tries to make the transition animation more natural.
Texture Sticking
As you can see from the above animations, the beard and hair (left) look like sticking to the screen when morphing, while the one generated by StyleGAN3 (right) does not have this sticky pixels problem.
Reason: Positional References
It turns out that there are some unintentional positional references available in the intermediate layers for the network to process feature maps through the following sources:
- Image borders
- Per-pixel noise inputs
- Positional encodings
- Aliasing
These positional references make the network generates pixels sticking to the same coordinates.
Among them, aliasing is the hardest one to identify and fix.
Aliasing
Aliasing is an effect that causes different signals to become indistinguishable (or aliases of one another) when sampled.
It also often refers to the distortion or artifact that results when a signal reconstructed from samples is different from the original continuous signal.
E.g. When we sample a high-frequency signal (e.g. the blue sine wave), if it results in a lower frequency signal (the red wave), then this is called aliasing. This happens because the sampling rate is too low.
The researchers have identified two sources for aliasing in GAN:
- Faint after-images of the pixel grid resulting from non-ideal upsampling filters (e.g. nearest, bilinear, or stridden convolutions)
- The pointwise application of nonlinearities (e.g. ReLU, or swish)
Even with a small amount of aliasing, it is amplified throughout the network and becomes a fixed position in the screen coordinates.
Goal
The goal is to eliminate all sources of positional references. After that, the details of the images can be generated equally well regardless of pixel coordinates.
In order to remove positional references, the paper proposed to make the network equivariant:
Consider an operation f (e.g. convolution, upsampling, ReLU, etc.) and a spatial transformation t (e.g. translation, rotation).
f is equivariant with respect to t if t \circ f = f \circ t.
In other words, an operation (e.g. ReLU), should not insert any positional codes/references which will affect the transformation procedure, and vice versa.
Redesigning Network Layers
Practical neural networks operate on discretely sampled feature maps. It is found that we need the network to operate in the continuous domain to efficiently suppress the aliasing problem. Therefore, we have to redesign the network layers (i.e. convolution, upsampling/downsampling, nonlinearity).
The details of how to redesign the network layers are too long and I am not going to include them here. I have explained them in “StyleGAN3 Clearly Explained!”.
If you didn’t read it, never mind. All you need to know is that by redesigning the network layers to operate on continuous feature maps (continuous signals), aliasing can be suppressed.
Changes
- Replaced the learned input constant in StyleGAN2 with the Fourier feature to facilitate exact continuous translation and rotation of z.
- Removed the per-pixel noise inputs to eliminate positional references introduced by them.
- Decreased the mapping network depth to simplify the setup.
- Eliminated the output skip connections (which were used to deal with the gradient vanishing problem. We address that by using a simple normalization before each convolution, i.e. divided by EMA: Exponential Moving Average)
- Maintained a fixed-size margin around the target canvas, cropping to this extended canvas after each layer (to leak absolute image coordinates into internal representations, because the signals outside the canvas are also important).
- Replaced the bilinear 2× upsampling filter with a better approximation of the ideal low-pass filter.
- Motivated by our theoretical model, we wrap the leaky ReLU between m×upsampling and m×downsampling. (we can fuse the previous 2×upsampling with this m×upsampling, i.e. choose m=2, then we have 4×upsampling before each leaky ReLU)
Note: the network layers (i.e. convolution, up/downsampling, ReLU) are replaced with the corresponding redesigned layers.
StyleGAN3 Results
The alias-free translation (middle) and rotation (bottom) equivariant networks build the image in a radically different manner from what appear to be multi-scale phase signals that follow the features seen in the final image.
In the internal representations, it looks like a new coordinate system is being invented and details are drawn on these surfaces.
Summary
The purposes of each of them:
- StyleGAN: to generate high-fidelity images.
- StyleGAN2: to remove water-droplet artifacts in StyleGAN.
- StyleGAN2-ADA: to train StyleGAN2 with limited data.
- StyleGAN3: to make transition animation more natural.
References
[1] T. Karras, S. Laine and T. Aila, “A Style-Based Generator Architecture for Generative Adversarial Networks”, arXiv.org, 2022. https://arxiv.org/abs/1812.04948
[2] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen and T. Aila, “Analyzing and Improving the Image Quality of StyleGAN”, arXiv.org, 2022. https://arxiv.org/abs/1912.04958
[3] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen and T. Aila, “Training Generative Adversarial Networks with Limited Data”, arXiv.org, 2022. https://arxiv.org/abs/2006.06676
[4] T. Karras et al., “Alias-Free Generative Adversarial Networks”, arXiv.org, 2022. https://arxiv.org/abs/2106.12423
[5] “Aliasing — Wikipedia”, En.wikipedia.org, 2022. https://en.wikipedia.org/wiki/Aliasing