CS 180: Introduction to Computational Photography and Computer Vision

Project 5: Diffusion Models

Stephen Su

Project 5A Overview

In this project, we learn about how diffusion models work using the DeepFloyd IF diffusion model. We will implement diffusion sampling loops, and then use them for tasks such as inpainting and creating optical illusions.

Part 0: Setup

Before we start, we first need to set up our pretrained model. Deepfloyd is a two stage diffusion model, where the first stage prduces an image of size $64 \times 64$ pixels, and the second stage produces an image of size $256 \times 256$ pixels. We can then sample from the model, varying the number of inference steps to take. Inference steps indicate how many denoising steps to take, with the a higher inference step correlating to higher image quality at the cost of computational cost. We also set a random seed to use for the rest of the project. We will be using the seed $0$. Below are some samples from the model given a prompt.

Figure 1: An Oil Painting of a Snowy Mountain Village

Stage 1, 20 Inference Steps

Stage 2, 20 Inference Steps

Stage 1, 100 Inference Steps

Stage 2, 100 Inference Steps

The quality for the image from running 100 inference steps seems noticably higher than that of the image from running 20 inference steps. There is more texture on the snow in both the mountains and the houses for the 100 inference steps.

Figure 2: A Man Wearing a Hat

Stage 1, 20 Inference Steps

Stage 2, 20 Inference Steps

The output from the model accurately describes the prompt, and even added more features in the image that wasn't in the prompt, such as facial hair and glasses. The quality of the output from the second stage is higher than that of the first stage. However, that is to be expected since stage 2 produces an image with more pixels.

Figure 3: A Rocket Ship

Stage 1, 20 Inference Steps

Stage 2, 20 Inference Steps

The model is able to correctly output a rocket ship. However, with only 20 inference steps, the details of the rocket ship seem quite lacking. The image seems to be a basic rocket ship with nothing fancy or sophisticated.

Part 1: Sampling Loops

In this part, we will create our own sampling loops using the pretained DeepFloyd denoisers. Starting with a clean image $x_0$, we can iteratively add noise to the image to get $x_t$, until we are left with pure noise at $t = T$. For the DeepFloyd models, the amount of noise added at each step is determined by a noise coefficient $\overline{\alpha}_t$, and $T$ by default is set to $T = 1000$. A diffusion model will try to reverse this process by predicting the noise and denoising the image. Given an image $x_t$, we can predict the noise, and with the noise we can either remove the noise entirely to get $x_0$ or remove a portion of the noise to estimate $x_{t - 1}$, with slightly less noise. We can repeatedly remove a portion of the noise until we are at a clean image $x_0$. If we want to sample images from the model, we can feed in pure noise at timestep $T$ from a gaussian distribution and apply the same process.

Part 1.1: Implementing the Forward Process

In the forward process, we take a clean image $x_0$, and add noise to the clean image to get a noisy image $x_t$ at timestep $t$. The noise is sampled from a gaussian distribution with mean $\sqrt{\overline{\alpha}_t}x_0$ and variance ($1 - \overline{\alpha}_t$). This is defined by $$ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\overline{\alpha}_t}x_0, (1 - \overline{\alpha}_t)\mathbf{I}) $$ This is equivalent to computing $$ x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1 - \overline{\alpha}_t}\epsilon \; , \; \epsilon \sim \mathcal{N}(0, 1) $$ Here, $\overline{\alpha}_t$ values are determined by the people over at DeepFloyd, where $\overline{\alpha}_t$ is close to $1$ for a small $t$, and close to $0$ for a large $t$. Below are some results after applying the forward process for various $t$ values.

Figure 4: The Campanile

Berkeley Campanile, $t = 0$

Noisy Campanile, $t = 250$

Noisy Campanile, $t = 500$

Noisy Campanile, $t = 750$

Part 1.2: Classical Denoising

Traditionally, if we want to remove noise, we would apply a gaussian blur filter to the noisy image. Below are the results from applying a gaussian blur to each of the noisy images above with a kernel size $k = 7$ and $\sigma = 2$.

Noisy Campanile, $t = 250$

Noisy Campanile, $t = 500$

Noisy Campanile, $t = 750$

Gaussian Blur Denoising, $t = 250

Gaussian Blur Denoising, $t = 500

Gaussian Blur Denoising, $t = 750

The results don't look very nice, and we will fix that in the upcoming parts.

Part 1.3: One-Step Denoising

Instead of using a gaussian blur, we will use a pretrained UNet denoiser from the first stage of DeepFloyd. The model can predict the gaussian noise from the image given a timestep $t$, and we can recover something close to the original image by subtracting the noise from the noisy image.

Note: Since the model was trained with text conditioning, we need to pass in a prompt. We will feed in the generic prompt "a high quality photo" into the model.

Noisy Campanile, $t = 250$

Noisy Campanile, $t = 500$

Noisy Campanile, $t = 750$

One-Step Denoised, $t = 250$

One-Step Denoised, $t = 500$

One-Step Denoised, $t = 750$

It's clear here that the diffusion model does a lot better in terms of denoising than using a gaussial blur filter.

Part 1.4: Iterative Denoising

In the previous part, we denoised using a single step. However, diffusion models were trained to denoise iteratively across hundreds of steps. We could start at timestep $T = 1000$ with $x_{1000}$, and iteratively denoise one step at a time until we get $x_0$. However, this is quite slow and costly. Instead, we can skip some steps and use strided timesteps insteads. The strided timesteps will start at $t = 990$, corresponding to the noisiest image, and take a stride of $30$ until we are at timestep $t = 0$, the clean image. On the $i^{th}$ step, we are at timestep $t$ with $x_t$, and want to get to $x_{t'}$ such that $t' < t$ using the following formula: $$ x_{t'} = \frac{\sqrt{\overline{\alpha}_{t'}}\beta_{t'}}{1 - \overline{\alpha}_{t}}x_0 + \frac{\sqrt{\alpha_t}(1 - \overline{\alpha}_{t'})}{1 - \overline{\alpha}_{t}}x_t + v_{\sigma} $$

Below are the results after applying iterative denoising to the Campanile example.

Noisy Campanile, $t = 90$

Noisy Campanile, $t = 240$

Noisy Campanile, $t = 390$

Noisy Campanile, $t = 540$

Noisy Campanile, $t = 690$

Berkeley Campanile

Iteratively Denoised

One-Step Denoised

Gaussian Blurred

Part 1.5: Diffusion Model Sampling

With iterative denoising, instead of starting with a given image, we can instead start with pure noise. Here, the model will be effectively denoising pure noise, generating an image from scratch. Below are some example outputs from the model using this process, with the prompt "a high quality photo" passed in.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

The quality of the images are quite poor, with many of them being too monotone. We will fix that in the next section.

Part 1.6: Classifier-Free Guidance (CFG)

To improve the image quality, we will use a technique called Classifier-Free Guidance, in which we compute a conditional noise estimate $\epsilon_c$ and an unconditional noise estimate $\epsilon_u$. Our new noise estimate will then be $$ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) $$ In the equation above, $\gamma$ controls the strenght of CFG. When $\gamma = 0$, we get an unconditional noise estimate, and when $\gamma = 1$, we get a conditional noise estmiate. However, when $\gamma > 1$, the quality of the image drastically improves. The reasons behind this phenomenon is still up for debate today, but here some sample images after applying this technique, using $\gamma = 7$.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

The images here are much more vibrant, with more colors than the images in the previous section

Part 1.7: Image to Image Translation

In this part, we apply the SDEdit algorithm to various images. The SDEdit algorithm starts by adding noise to an image, and then force the image back to the image manifold without any conditioning, getting an output image simiar to the original image with a few "edits". We will experiment with different starting indices in our strided timesteps, mainly $i_{start} \in \{1, 3, 5, 7, 10, 20\}$.

Figure 5: SDEdit Campanile

Berkeley Campanile

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Figure 6: SDEdit Nevada Beach

Nevada Beach, Lake Tahoe

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Figure 7: SDEdit Donner Lake

Donner Lake, Truckee

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Part 1.7.1: Editing Hand Drawn and Web Images

The SDEdit algorithm works particularly well with non-realistic image, such as drawings, and projecting it onto the natural image manifold. Below are some examples of this algorithm with nonrealistic images.

Figure 8: SDEdit Lightning McQueen

Lightning McQueen, Cars

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Figure 9: SDEDit Blue Car Drawing

My Drawing of a Blue Car

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Figure 10: SDEdit Whale and Fish Drawing

My Drawing of Whale and Fish

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Part 1.7.2: Inpainting

Here, we implement the inpainting procedure. The inpainting procedure starts out with an image $x_{orig}$ and a binary mask $\mathbf{m}$, and creates a new image where $\mathbf{m} = 1$, while keeping the original image where $\mathbf{m} = 0$. At every step of the diffusion loop, after obtaining $x_{t'}$, we "force" $x_{t'}$ to have the same pixels as $x_{orig}$ where $\mathbf{m} = 0$ through the equation $$ x_{t'} \leftarrow \mathbf{m} x_{t'} + (\mathbf{1} - \mathbf{m}) \cdot f(x_{orig}, t') $$ where $f$ is a function of the forward process from earlier. Below are some results with the inpainting procedure implemented.

Figure 11: Inpainting Campanile

Berkeley Campanile

Mask

Masked Campanile

Campanile Inpainted

Figure 12: Inpainting Mount Rushmore

Mount Rushmore

Mask

Masked Mount Rushmore

Mount Rushmore Inpainted

Figure 13: Inpainting Watermelons

Watermelons

Mask

Masked Watermelons

Watermelons Inpainted

Part 1.7.3: Text Conditional Image to Image Translation

Instead of projecting onto the image manifold using the prompt "a high quality photo", we will instead guide the projection down with a text prompt. This is done by changing the prompt "a high quality photo" into a more specific prompt. Below are some results where we change the prompt. Similar to previous parts, we will vary the noise levels to visualize the differences.

Figure 14: Campanile, "A Rocket Ship"

Berkeley Campanile

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Figure 15: Dog, "A photo of a dog"

Dog

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Figure 16: Oppenheimer, "A photo of a man"

Oppenheimer

$i_{start} = 1$

$i_{start} = 3$

$i_{start} = 5$

$i_{start} = 7$

$i_{start} = 10$

$i_{start} = 20$

Part 1.8: Visual Anagrams

In this part, we will create optical illusion, which are images that look like one thing, but when flipped upside down, the image looks like another thing. To do this, we need to adjust the way we calculate noise in our process. At step $t$, we will denoise an image $x_t$ with one prompt to optain $\epsilon_1$. At the same time, we will flip $x_t$ upside and denoise with a different prompt to get noise estimate $\epsilon_2$. We will flip $\epsilon_2$ right-side up, and then average the two noise estimates. The procedure can be summarized in the following equations. $$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$ $$ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) $$ $$ \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} $$ where $\text{UNet}$ is our model, $\text{flip}$ is a function that flips our image, and $p_1$ and $p_2$ are two different prompt embeddings. Our final estimate is $\epsilon$, and we proceed normally as before, applying CFG as well. Below are some results.

Figure 17

An Oil Painting of an Old Man

An Oil Painting of People Around a Campfire

Figure 18

A Painting of Albert Einstein

An Painting of a Tiger

Figure 19

An Oil Painting of a Red Panda

An Oil Painting of a Fox

Figure 20

A Drawing of a Walrus

A Drawing of a Lamb

Part 1.9: Hybrid Images

Similar to Project 2, we can use diffusion models to create so-called hybrid images, those that look like one thing close up, and another thing farther away. To do so, we can create a composite noise $\epsilon$ by estimating the noise with two different prompts, and then combining the low frequencies of one noise estimate with the high frequencies of another noise estimate. This can be summarized as follows $$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$ $$ \epsilon_2 = \text{UNet}(x_t, t, p_2) $$ $$ \epsilon = {f_{lowpass}}(\epsilon_1) + {f_{highpass}}(\epsilon_2) $$ where $f_{lowpass}$ and $f_{highpass}$ are low and high pass functions from performing a gaussian blur, and $p_1$ and $p_2$ are two different text prompts. Our final estimate is $\epsilon$, and we proceed normally as before, applying CFG as well. Below are some results.

Hybrid Image of Skull and Waterfall

Hybrid Image of Gym and Hamburger

Hybrid Image of Bird and Feather

Hybrid Image of NYC and Panda


Project 5B Overview

In this section of the project, we train a diffusion model on the MNIST dataset to generate images of MNIST digits. We'll first start by training a UNet to do single-step denoising, and then we will train a UNet to iteratively denoise by adding time conditioning and class conditioning. We will sample results along the way to see what the model outputs.

Part 1: Training a Single-Step Denoising UNet

In this part, we implement a denoiser as a UNet. The UNet takes in an image with some level of noise added to it, and the UNet outputs a prediction of what the denoised images looks like. We will use the MNIST dataset, which consists of $28 \times 28$ pixel black and white images of digits. Below is a diagram of the UNet architecture.

UNet Architecture

After implementing the UNet, we can then look at our dataset looks like. For the dataset, we need to generate training data pairs $(z, x)$, where each $x$ is a MNIST digit, and each $z$ is $x$ with some added noise. We use the following equation with to determine $z$. $$z = x + \sigma\epsilon, \; \epsilon \sim \mathcal{N}(0, \mathbf{I})$$ Here, we allow $\sigma$ to be $\sigma \in \{0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0 \}$, which controls the strength of the noise. A higher $\sigma$ value indicates more noise added. Below is a visualization of different $\sigma$ values across different digits.

Figure 21: Visualization of Noising Process

$\sigma = 0.0$ $\sigma = 0.2$ $\sigma = 0.4$ $\sigma = 0.5$ $\sigma = 0.6$ $\sigma = 0.8$ $\sigma = 1.0$

During the training process, the model will attempt to denoise noisy MNIST digits $z$ generated using $\sigma = 0.5$ applied to a digit $x$. For the UNet model, we will use a hidden dimension $D = 128$, optimized using mean squared error as the loss function. We will train using a batch size of $256$ for $5$ epochs. We will also use the Adam optimizer with a learning rate of $1 \times 10^{-4}$. Below is a plot of the training loss over each gradient descent step we take.

Figure 22: Training Loss Curve

Below are some results from the model after the first and fifth epoch.

Figure 23: Results From Test Set After 1 Epoch

Input Noisy, $\sigma = 0.5$ Output

Figure 24: Results From Test Set After 5 Epochs

Input Noisy, $\sigma = 0.5$ Output

We can also see how our denoiser performs on various $\sigma$ values that it wasn't trained for.

Figure 25: Results on Digits From Test Set With Various Noise Levels

Noisy with $\sigma = 0.0$

Denoised Image

Noisy with $\sigma = 0.2$

Denoised Image

Noisy with $\sigma = 0.4$

Denoised Image

Noisy with $\sigma = 0.5$

Denoised Image

Noisy with $\sigma = 0.6$

Denoised Image

Noisy with $\sigma = 0.8$

Denoised Image

Noisy with $\sigma = 1.0$

Denoised Image

Part 2: Training a Diffusion Model

Time Conditioning

Up until now, our UNet model only predicted the clean image. In this part, we will make a slight change: instead of predicting the clean image, we predict the noise $\epsilon$ added. Eventually, we want to sample a pure noise $\epsilon \sim \mathcal{N}(0, \mathbf{I})$ and generate a realistic image $x$ by iteratively denoising. To iteratively denoise, we will use timesteps again, similar to previous sections. Using the equation $$ x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1 - \overline{\alpha}_t}\epsilon \; , \; \epsilon \sim \mathcal{N}(0, 1) $$ we want to generate a noisy image $x_t$ from $x_0$ for some time step $t \in \{0, 1, \dots, T\}$. When $t = 0$, $x_t$ is a clean image. When $t = T$, $x_t$ is pure noise. For $t \in \{1, \dots, T - 1 \}$, $x_t$ should be a linear combination of the clean image and noise. The derivations for $\beta$, $\alpha_t$, and $\overline{\alpha}_t$ can be found in the DDPM paper. For our use cases, we can set $T = 300$ since our dataset is simple.

Because the variance of $x_t$ varies with $t$, we cannot simply feed $x_t$ into our UNet. We need to condition our input on $t$ by adding fully connected blocks to our UNet. Below is an updated diagram of our UNet architecture.

Conditional UNet Architecture

To train the model, we pick a random image from the training set, a random $t$, and train the denoiser to predict the noise in $x_t$. It follows the algorithm below.

For training, we will train for $20$ epochs using a batch size of $128$. For the UNet, we will use a hidden dimension $D = 64$. We will also use the Adam optimizer with an initial learning rate of $1 \times 10^{-3}$, along with a exponential learning rate decay scheduler with $\gamma = 0.1^{\frac{1}{20}}$. Below is the training loss curve for the time-conditioned UNet model.

Figure 26: Training Loss Curve for Time-Conditioned UNet

We can also sample from our model, similar to previous sections. To sample, we use the following algorithm.

Below are the results from sampling the model after each epoch. Feel free to hover over the images to see a GIF of the sampling process!

Figure 27: Samples From Time-Conditioned UNet

Image

Epoch 1

Image

Epoch 5

Image

Epoch 10

Image

Epoch 15

Image

Epoch 20

Class Conditioning

To better control the image generation, we can condition our UNet on the class of digits $\{0, \dots, 9\}$. To represent each class, we will use a one-hot vector. We will also add two additional fully-connected blocks to feed in our class. Moreover, we will also implement dropout, where 10% of the time, we drop the class conditioning by setting the one-hot vector to 0. This is so that our UNet will still work even if it isn't conditioned on a class.

To train, we will use the following algorithm. The algorithm is similar to the training algorithm before, with the added process of computing the one-hot vector for each image.

We will use the same training hyperparameters when training the time conditioned UNet. Below is a curve of the training losses.

Figure 28: Training Loss Curve for Class-Conditioned UNet

To sample, we will use the same technique from part A of this project. We saw before that class conditional results aren't good unless we use classifier-free guidance. Thus, we will use classifier-free guidance with $\gamma = 5$.

Here are some results sampled after each epoch. Again, feel free to hover over the images to see a GIF of the sampling process!

Figure 29: Samples From Class-Conditioned UNet

Image

Epoch 1

Image

Epoch 5

Image

Epoch 10

Image

Epoch 15

Image

Epoch 20


Project Insights

I really enjoyed learning about how diffusion models work. For project 5A, My favorite part was being able to generate my own images from scratch, and trying different prompts to see what works best! For project 5B, my favorite part was implementing the model myself and seeing how the images go from pure noise to a something clean at each and every timestep.

Citations