## Introduction:

We’re currently in the midst of a generative AI boom. In November 2022, Open AI’s generative language model ChatGPT shook up the world, and in March 2023 we even got GPT-4!

Even though the future of these LLMs is extremely exciting, today we will be focusing on image generation. With the rise of diffusion models, image generation took a giant leap forward. Now we’re surrounded by models like DALL-E 2, Stable Diffusion, and Midjourney. For example, see the following image below. Just to show the power of these LLMs, I gave ChatGPT a very simple prompt, which I then fed into the free CatbirdAI. CatbirdAI uses different models, including Openjourney, Dreamlike Diffusion, and more:

In this article **Daisuke Yamada** (my co-author) and I will work towards diffusion. We’ll use 3 different models and generate images in the style of MNIST handwritten digits using each of them. The first model will be a traditional Variational Autoencoder (VAE). We’ll then discuss GANs and implement a Deep Convolution GAN (DCGAN). Finally, we’ll turn to diffusion models and implement the model described in the paper Denoising Diffusion Probabilistic Models. For each model we’ll go through the theory working behind the scenes before implementing in Tensorflow/Keras.

A quick note on notation. We will use try to use subscripts like ** x₀, **but there may be times where instead we will have to use

**to denote subscript.**

*x_T*Let’s briefly discuss prerequisites. It’s important to be familiar with deep learning and comfortable with Tensorflow/Keras. Further, you should be familiar with VAEs and GANs; we will go over the main theory but prior experience will be helpful. If you’ve never seen these models, check out these helpful sources: MIT S6.191 Lecture, Stanford Generative Model Lecture, VAE Blog. Finally, there’s no need to be familiar with DCGANs or diffusion. Great! Let’s get started.

## Generative Model Trilemma:

As an unsupervised process, generative AI often lacks well defined metrics to track progress. But before we approach any methods to evaluate generative models, we need to understand what generative AI is actually trying to accomplish! The goal of generative AI is to take training samples from some unknown**, **complex data distribution** **(e.g., the distribution of human faces)** **and learn a model that can “capture this distribution”. So, what factors are relevant in evaluating such a model?

We certainly want high quality samples, i.e. the generated data should be realistic and accurate compared to the actual data distribution. Intuitively we can just subjectively evaluate this by looking at the outputs. This is formalized and standardized in a benchmark known as HYPE (Human eYe Perceptual Evaluation). Although there are other quantitative methods, today we will just rely on our own subjective evaluation.

It’s also important to have fast sampling (i.e., the speed of generation, or scalability)**. **One particular aspect we will look at is the number of network passes required to generate a new sample. For example, we will see that GANs will require just one pass of the generator network to turn noise into a (hopefully) realistic data sample, while DDPMs require sequential generation,** **which ends up making it much slower.

A final important quality is known as mode coverage. We don’t just want to learn a specific part of the unknown distribution, but rather we want to capture the entire distribution to ensure sample diversity. For example, we don’t want a model that just outputs images of 0s and 1s, but rather all possible digit classes.

Each of these three important factors (quality of samples, speed of sampling, and mode coverage), are covered in the **“****Generative Model Trilemma****”.**

Now that we understand how we will compare and contrast these models, let’s dive into VAEs!

## Variational Autoencoder:

One of the first generative models that you will encounter is the Variational Autoencoder (VAE). Since VAEs are just traditional autoencoders with a probabilistic spin, let’s remind ourselves of autoencoders.

Autoencoders are dimensionality reduction models that learn to compress data into some latent representation:

The encoder compresses the input into a latent representation called the bottleneck, and then the decoder reconstructs the input. The decoder reconstructing the input means we can train with L2 loss between input/output.

Autoencoders cannot be used for image generation since they overfit, which leads to a sparse latent space that is discontinuous and disconnected (non-regularizable). VAEs fix this by encoding the input ** x **as a distribution over the latent space:

The input ** x **gets fed into the encoder

**The output**

*E.***is**

*E(x)***a vector of means and vector of standard deviations which parameterize a distribution**

**The common choice is a multivariate standard Gaussian.**

*P(z | x).***From here we sample**

**and finally the decoder attempts to reconstruct**

*z ~ P(z | x),***from**

*x***just like with the autoencoder).**

*z (*Notice this sampling process is non-differentiable, so we need to change something to allow backpropagation to be possible. To do so we use the reparameterization trick, where we move sampling to an input layer** **by first sampling ** ϵ ~ N(0,1). **Then we can perform a fixed sampling step:

**Notice we get the same sampling, but now we have a clear path to backpropagate error since the only stochastic node is an input!**

*z*=*μ*+*σ*⊙*ϵ.*Recall training for autoencoders is L2 loss, which constitutes a reconstruction term. For VAEs, we also add a regularization term, which is used to make the latent space “well-behaved”: