DiagonalGAN — Content-Style Disentanglement in StyleGAN Explained!

Paper Explained: Diagonal Attention and Style-based GAN for Content-Style Disentanglement in Image Generation and Translation

Content-style manipulation of a face
Content-style manipulation of a face

Originally posted on My Medium.

Introduction

In image generation, Content-Style Disentanglement has been an important task.

It aims to separate the content and style of the generated image from the latent space that is learned by a GAN.

  • Content: spatial information (e.g. face direction, expression)
  • Style: other features (e.g. color, strokes, makeup, gender)

This DiagonalGAN paper proposes a method to disentangle the content and style in StyleGAN.

StyleGAN Problem

Style and content in StyleGAN
Style and content in StyleGAN

StyleGAN tries to disentangle the style and content using the AdaIN operation.

However, the content controlled by per-pixel noises is mostly for minor spatial variations. Therefore, the disentanglement of global content and style is by no means complete.

DiagonalGAN

DiagonalGAN architecture
DiagonalGAN architecture

This paper proposes to:

  • replace the per-pixel noises with a content code mapping module similar to the style code mapping module.
  • use the DAT (Diagonal Attention) layer to control the content.

Theory

It is found that the normalization and spatial attention modules have similar structures that can be exploited for style and content disentanglement.

Specifically, the AdaIN and attention formulas can be written in matrix forms.

AdaIN

The AdaIN can be represented as Y=XT+R:

Matrix representation of AdaIN
Matrix representation of AdaIN

Here, I use C=3 (RGB) for a clearer illustration of the idea. In intermediate activations, C usually equals the number of filters used in the convolution layer.

Attention

The spatial attention (e.g. self-attention) can be represented as Y=AX, where A is a fully populated matrix of size HW×HW.

Since the transformation matrix A is applied to a pixel-wise direction to manipulate the feature values of a specific location, it can control the spatial information such as shape and location.

Diagonal Attention

We can mix the attention and the AdaIN by writing Y=AXT+R, where A controls the content and T controls the style of the image.

Matrix representation of diagonal attention + AdaIN
Matrix representation of diagonal attention + AdaIN

Compare to StyleGAN

In StyleGAN, the content code C generated from the per-pixel noises via the scaling network \text{B} is added to feature X before entering the AdaIN layer (i.e. Y=(X+C)T+R).

By expanding it, we get the original AdaIN formula plus an additional bias CT, which is different from spatial attention because that is multiplicative to feature X.

Although one could potentially generate C so that the net effect is similar
to AX, this would require a complicated content code generation network. This explains the fundamental limitation of the content control in the original StyleGAN.

DAT Layer

Details of the DAT layer
Details of the DAT layer
A=(I+\beta\ \text{diag}(d))
  • d: d \in W_c is a differential attention map of dimension H×W.
  • \beta: used to calibrate whether the layer is responsible for minor or major changes, so as to prevent overemphasizing minor changes.
  • \text{diag}(d): denotes the diagonal matrix whose diagonal elements are d.

The diagonal spatial attention is multiplicative (i.e. A multiplies X) so that we can control global spatial variations.

Loss Function

L_{total} = L_{adv} + L_{ds}
  • L_{adv}: non-saturating adversarial loss with R_1 regularization.
  • L_{ds}: Diversity-Sensitive loss which encourages the attention network to yield diverse maps.
L_{ds} = \max (\lambda - \| G(z_s, z_c^1) - G(z_s, z_c^2) \|_1, 0)

The objective of L_{ds} is to maximize the L_1 distance between the generated images from different content codes z_c^1, z_c^2 with the same style code z_s (i.e. make them as diverse as possible).

The threshold \lambda is used to avoid the explosion of the loss value.

Results

Example results of fixed style and different content codes
Example results of fixed style and different content codes

The images on the second row are generated with changing content codes at entire layers under the fixed style code.

The images in the following rows are sampled with changing the content codes at specific layers while fixing those of other layers.

You can see that the overall attributes of the face (style) do not change while the face direction, smiling, etc. (content) change when we adjust the content codes in a hierarchical manner.

Example results of fixed/different content codes and style codes
Example results of fixed/different content codes and style codes
  • (a): source image generated from arbitrary content and style codes.
  • (b): varying style codes with fixed content codes.
  • (c): varying content codes with fixed style codes.
  • (d): varying both codes.

Direct Attention Map Manipulation

Direct attention map manipulation
Direct attention map manipulation

By controlling the specific areas of attention, we can selectively change the facial attributes.

  • (a): 1st 4×4 attention map: control face direction by changing the activated regions.
  • (b): 2nd 8×8 attention map: control mouth expression with high values on larger mouth areas.
  • (c): 2nd 16×16 attention map: control the size of eyes by changing activated pixel areas of the eyes.

Code

References

[1] G. Kwon and J. Ye, “Diagonal Attention and Style-based GAN for Content-Style Disentanglement in Image Generation and Translation”, arXiv.org, 2022. https://arxiv.org/abs/2103.16146

[2] G. Kwon and J. Ye, “DiagonalGAN”, Github, 2022. https://github.com/cyclomon/DiagonalGAN

Avatar photo
Steins

Developer & AI Researcher. Write about AI, web dev/hack.