JoJoGAN — Style Transfer on Faces Using StyleGAN — Create JoJo Faces (with codes)

Paper Explained: JoJoGAN — One-Shot Face Stylization

Originally posted on My Medium.

Introduction

JoJoGAN overview
JoJoGAN overview

JoJoGAN is a style transfer procedure that lets you transfer the style of a face image to another style.

It accepts only one style reference image and quickly produces a style mapper that accepts an input and applies the style to the input.

Applying different styles
Applying different styles

Although it is called JoJoGAN, it can learn any style but not only JoJo style (i.e. the style of an Anime called JoJo).

JoJoGAN Workflow

JoJoGAN workflow
JoJoGAN workflow

Step 1: GAN inversion

Normally, GAN produces an image from an input latent noise. GAN inversion means obtaining the corresponding latent noise from a given image.

GAN inversion
GAN inversion

We encode the style reference image y to obtain a latent style code w = T(y) and then from that, we get a set of style parameters s.

GAN inversion for StyleGAN
GAN inversion for StyleGAN

For the choice of GAN inverter (encoder T), the researchers compared e4e, II2S, and ReStyle. They found that ReStyle gives the most accurate reconstruction leading to stylization that better preserves the features and properties of the input.

GAN inverters comparison
GAN inverters comparison

Step 2: Training set

By using StyleGAN’s style mixing mechanic, we can create a training set to fine-tune StyleGAN.

Assuming the StyleGAN has 26 style modulation layers, then we define a mask M \in \{0, 1\}^{26}, which is an array of length 26 storing either 0 or 1. By switching M on (1) and off (0) in different layers, we can mix s and s(FC(z_i)) to create many pairs (s_i, y) for our training set.

We produce new style codes using s_i = M \cdot s+(1−M) \cdot s(FC(z_i)).

Create a training set using style mixing
Create a training set using style mixing

If you don’t know how StyleGAN’s style mixing mechanic works, it is simply mixing different style codes in different style modulation layers to create different outputs.

For details, you can read the style mixing part of my previous article:

Step 3: Finetuning

By using the training set sᵢ, we can fine-tune the StyleGAN to enforce the images generated from these style mixing codes s_i to be close to the style reference image y.

Finetuning StyleGAN
Finetuning StyleGAN

This learns a mapping from an image of any style to the image of a specific style (i.e. style reference y) but preserves the overall spatial contents (i.e. the face/identity of that person).

Step 4: Stylize new faces

After finetuning the StyleGAN, we can simply invert our input to style codes, and then generate an image using the finetuned StyleGAN (which will apply the target style to the generated image).

Inference
Inference

Results

JoJoGAN results
JoJoGAN results
Comparison of JoJoGAN and other style transfer methods
Comparison of JoJoGAN and other style transfer methods

Demo

The authors created a replicate demo and a Colab notebook demo.

Face Not Detected Issue

However, I found that they do not accept complicated style references or inputs. I even tried inputting one of the style images mentioned in the paper, but it just said: “Face not detected”.

“Face not detected” error in the replicate web demo
“Face not detected” error in the replicate web demo

By reviewing their codes, I found out that in the GAN inversion part, they try to do the following:

  1. Detect and crop the face out of the input image
  2. Use the cropped image to get the latent code

If face landmarks are not detected in the 1st step, it cannot proceed to the 2nd step.

Fix

I fixed the notebook codes by handling the exception in the 1st step:

  • If the 1st step failed, it will continue by using the input image directly

Here is the fixed notebook. You can try it out to stylize more complicated faces.

Note: for complicated images, you would like to manually align and crop them into square with face in the center before uploading them to JoJoGAN/text_input/JoJoGAN/style_images.

Crop and align the face
Crop and align the face

Demo Results

Demo results with complicated style reference
Demo results with complicated style reference

Preserve Color

Without preserving input image color
Without preserving input image color
Preserving input image color
Preserving input image color

Non-face Style Reference

You can also try out non-face style references, though the results will not be ideal since the inverter and the StyleGAN were pretrained on the FFHQ dataset.

Without preserving input image color
Without preserving input image color
Preserving input image color
Preserving input image color

References

[1] M. Chong and D. Forsyth, “JoJoGAN: One-Shot Face Stylization”, arXiv.org, 2022. https://arxiv.org/abs/2112.11641

Avatar photo
Steins

Developer & AI Researcher. Write about AI, web dev/hack.