DiagonalGAN — Content-Style Disentanglement in StyleGAN Explained!

13 February, 2023

AIDeep LearningGANGenerative ModelMachine LearningNeural Networks

Paper Explained: Diagonal Attention and Style-based GAN for Content-Style Disentanglement in Image Generation and Translation

Originally posted on My Medium.

Introduction

In image generation, Content-Style Disentanglement has been an important task.

It aims to separate the content and style of the generated image from the latent space that is learned by a GAN.

Content: spatial information (e.g. face direction, expression)
Style: other features (e.g. color, strokes, makeup, gender)

This DiagonalGAN paper proposes a method to disentangle the content and style in StyleGAN.

StyleGAN Problem

StyleGAN tries to disentangle the style and content using the AdaIN operation.

However, the content controlled by per-pixel noises is mostly for minor spatial variations. Therefore, the disentanglement of global content and style is by no means complete.

DiagonalGAN

This paper proposes to:

replace the per-pixel noises with a content code mapping module similar to the style code mapping module.
use the DAT (Diagonal Attention) layer to control the content.

Theory

It is found that the normalization and spatial attention modules have similar structures that can be exploited for style and content disentanglement.

Specifically, the AdaIN and attention formulas can be written in matrix forms.

AdaIN

The AdaIN can be represented as $Y=XT+R$ :

Here, I use C=3 (RGB) for a clearer illustration of the idea. In intermediate activations, C usually equals the number of filters used in the convolution layer.

Attention

The spatial attention (e.g. self-attention) can be represented as $Y=AX$ , where $A$ is a fully populated matrix of size HW×HW.

Since the transformation matrix $A$ is applied to a pixel-wise direction to manipulate the feature values of a specific location, it can control the spatial information such as shape and location.

Diagonal Attention

We can mix the attention and the AdaIN by writing $Y=AXT+R$ , where $A$ controls the content and $T$ controls the style of the image.

Compare to StyleGAN

In StyleGAN, the content code $C$ generated from the per-pixel noises via the scaling network $\text{B}$ is added to feature $X$ before entering the AdaIN layer (i.e. $Y=(X+C)T+R)$ .

By expanding it, we get the original AdaIN formula plus an additional bias $CT$ , which is different from spatial attention because that is multiplicative to feature $X$ .

Although one could potentially generate $C$ so that the net effect is similar
to $AX$ , this would require a complicated content code generation network. This explains the fundamental limitation of the content control in the original StyleGAN.

DAT Layer

A=(I+\beta\ \text{diag}(d))

$d$ : $d \in W_c$ is a differential attention map of dimension H×W.
$\beta$ : used to calibrate whether the layer is responsible for minor or major changes, so as to prevent overemphasizing minor changes.
$\text{diag}(d)$ : denotes the diagonal matrix whose diagonal elements are d.

The diagonal spatial attention is multiplicative (i.e. $A$ multiplies $X$ ) so that we can control global spatial variations.

Loss Function

L_{total} = L_{adv} + L_{ds}

$L_{adv}$ : non-saturating adversarial loss with $R_1$ regularization.
$L_{ds}$ : Diversity-Sensitive loss which encourages the attention network to yield diverse maps.

L_{ds} = \max (\lambda - \| G(z_s, z_c^1) - G(z_s, z_c^2) \|_1, 0)

The objective of $L_{ds}$ is to maximize the $L_1$ distance between the generated images from different content codes $z_c^1$ , $z_c^2$ with the same style code $z_s$ (i.e. make them as diverse as possible).

The threshold $\lambda$ is used to avoid the explosion of the loss value.

Results

The images on the second row are generated with changing content codes at entire layers under the fixed style code.
The images in the following rows are sampled with changing the content codes at specific layers while fixing those of other layers.

You can see that the overall attributes of the face (style) do not change while the face direction, smiling, etc. (content) change when we adjust the content codes in a hierarchical manner.

Example results of fixed/different content codes and style codes

(a): source image generated from arbitrary content and style codes.
(b): varying style codes with fixed content codes.
(c): varying content codes with fixed style codes.
(d): varying both codes.

Direct Attention Map Manipulation

By controlling the specific areas of attention, we can selectively change the facial attributes.

(a): 1st 4×4 attention map: control face direction by changing the activated regions.
(b): 2nd 8×8 attention map: control mouth expression with high values on larger mouth areas.
(c): 2nd 16×16 attention map: control the size of eyes by changing activated pixel areas of the eyes.

Code

https://github.com/cyclomon/DiagonalGAN

References

[1] G. Kwon and J. Ye, “Diagonal Attention and Style-based GAN for Content-Style Disentanglement in Image Generation and Translation”, arXiv.org, 2022. https://arxiv.org/abs/2103.16146

[2] G. Kwon and J. Ye, “DiagonalGAN”, Github, 2022. https://github.com/cyclomon/DiagonalGAN