In this project, we learn about how diffusion models work using the DeepFloyd IF diffusion model. We will implement diffusion sampling loops, and then use them for tasks such as inpainting and creating optical illusions.
Before we start, we first need to set up our pretrained model. Deepfloyd is a two stage diffusion model, where the first stage prduces an image of size $64 \times 64$ pixels, and the second stage produces an image of size $256 \times 256$ pixels. We can then sample from the model, varying the number of inference steps to take. Inference steps indicate how many denoising steps to take, with the a higher inference step correlating to higher image quality at the cost of computational cost. We also set a random seed to use for the rest of the project. We will be using the seed $0$. Below are some samples from the model given a prompt.
Stage 1, 20 Inference Steps |
Stage 2, 20 Inference Steps |
Stage 1, 100 Inference Steps |
Stage 2, 100 Inference Steps |
The quality for the image from running 100 inference steps seems noticably higher than that of the image from running 20 inference steps. There is more texture on the snow in both the mountains and the houses for the 100 inference steps.
Stage 1, 20 Inference Steps |
Stage 2, 20 Inference Steps |
The output from the model accurately describes the prompt, and even added more features in the image that wasn't in the prompt, such as facial hair and glasses. The quality of the output from the second stage is higher than that of the first stage. However, that is to be expected since stage 2 produces an image with more pixels.
Stage 1, 20 Inference Steps |
Stage 2, 20 Inference Steps |
The model is able to correctly output a rocket ship. However, with only 20 inference steps, the details of the rocket ship seem quite lacking. The image seems to be a basic rocket ship with nothing fancy or sophisticated.
In this part, we will create our own sampling loops using the pretained DeepFloyd denoisers. Starting with a clean image $x_0$, we can iteratively add noise to the image to get $x_t$, until we are left with pure noise at $t = T$. For the DeepFloyd models, the amount of noise added at each step is determined by a noise coefficient $\overline{\alpha}_t$, and $T$ by default is set to $T = 1000$. A diffusion model will try to reverse this process by predicting the noise and denoising the image. Given an image $x_t$, we can predict the noise, and with the noise we can either remove the noise entirely to get $x_0$ or remove a portion of the noise to estimate $x_{t - 1}$, with slightly less noise. We can repeatedly remove a portion of the noise until we are at a clean image $x_0$. If we want to sample images from the model, we can feed in pure noise at timestep $T$ from a gaussian distribution and apply the same process.
In the forward process, we take a clean image $x_0$, and add noise to the clean image to get a noisy image $x_t$ at timestep $t$. The noise is sampled from a gaussian distribution with mean $\sqrt{\overline{\alpha}_t}x_0$ and variance ($1 - \overline{\alpha}_t$). This is defined by $$ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\overline{\alpha}_t}x_0, (1 - \overline{\alpha}_t)\mathbf{I}) $$ This is equivalent to computing $$ x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1 - \overline{\alpha}_t}\epsilon \; , \; \epsilon \sim \mathcal{N}(0, 1) $$ Here, $\overline{\alpha}_t$ values are determined by the people over at DeepFloyd, where $\overline{\alpha}_t$ is close to $1$ for a small $t$, and close to $0$ for a large $t$. Below are some results after applying the forward process for various $t$ values.
Berkeley Campanile, $t = 0$ |
Noisy Campanile, $t = 250$ |
Noisy Campanile, $t = 500$ |
Noisy Campanile, $t = 750$ |
Traditionally, if we want to remove noise, we would apply a gaussian blur filter to the noisy image. Below are the results from applying a gaussian blur to each of the noisy images above with a kernel size $k = 7$ and $\sigma = 2$.
Noisy Campanile, $t = 250$ |
Noisy Campanile, $t = 500$ |
Noisy Campanile, $t = 750$ |
Gaussian Blur Denoising, $t = 250 |
Gaussian Blur Denoising, $t = 500 |
Gaussian Blur Denoising, $t = 750 |
The results don't look very nice, and we will fix that in the upcoming parts.
Instead of using a gaussian blur, we will use a pretrained UNet denoiser from the first stage of DeepFloyd. The model can predict the gaussian noise from the image given a timestep $t$, and we can recover something close to the original image by subtracting the noise from the noisy image.
Note: Since the model was trained with text conditioning, we need to pass in a prompt. We will feed in the generic prompt "a high quality photo" into the model.
Noisy Campanile, $t = 250$ |
Noisy Campanile, $t = 500$ |
Noisy Campanile, $t = 750$ |
One-Step Denoised, $t = 250$ |
One-Step Denoised, $t = 500$ |
One-Step Denoised, $t = 750$ |
It's clear here that the diffusion model does a lot better in terms of denoising than using a gaussial blur filter.
In the previous part, we denoised using a single step. However, diffusion models were trained to denoise iteratively across hundreds of steps. We could start at timestep $T = 1000$ with $x_{1000}$, and iteratively denoise one step at a time until we get $x_0$. However, this is quite slow and costly. Instead, we can skip some steps and use strided timesteps insteads. The strided timesteps will start at $t = 990$, corresponding to the noisiest image, and take a stride of $30$ until we are at timestep $t = 0$, the clean image. On the $i^{th}$ step, we are at timestep $t$ with $x_t$, and want to get to $x_{t'}$ such that $t' < t$ using the following formula: $$ x_{t'} = \frac{\sqrt{\overline{\alpha}_{t'}}\beta_{t'}}{1 - \overline{\alpha}_{t}}x_0 + \frac{\sqrt{\alpha_t}(1 - \overline{\alpha}_{t'})}{1 - \overline{\alpha}_{t}}x_t + v_{\sigma} $$
Noisy Campanile, $t = 90$ |
Noisy Campanile, $t = 240$ |
Noisy Campanile, $t = 390$ |
Noisy Campanile, $t = 540$ |
Noisy Campanile, $t = 690$ |
Berkeley Campanile |
Iteratively Denoised |
One-Step Denoised |
Gaussian Blurred |
With iterative denoising, instead of starting with a given image, we can instead start with pure noise. Here, the model will be effectively denoising pure noise, generating an image from scratch. Below are some example outputs from the model using this process, with the prompt "a high quality photo" passed in.
Sample 1 |
Sample 2 |
Sample 3 |
Sample 4 |
Sample 5 |
Sample 6 |
The quality of the images are quite poor, with many of them being too monotone. We will fix that in the next section.
To improve the image quality, we will use a technique called Classifier-Free Guidance, in which we compute a conditional noise estimate $\epsilon_c$ and an unconditional noise estimate $\epsilon_u$. Our new noise estimate will then be $$ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) $$ In the equation above, $\gamma$ controls the strenght of CFG. When $\gamma = 0$, we get an unconditional noise estimate, and when $\gamma = 1$, we get a conditional noise estmiate. However, when $\gamma > 1$, the quality of the image drastically improves. The reasons behind this phenomenon is still up for debate today, but here some sample images after applying this technique, using $\gamma = 7$.
Sample 1 |
Sample 2 |
Sample 3 |
Sample 4 |
Sample 5 |
Sample 6 |
The images here are much more vibrant, with more colors than the images in the previous section
In this part, we apply the SDEdit algorithm to various images. The SDEdit algorithm starts by adding noise to an image, and then force the image back to the image manifold without any conditioning, getting an output image simiar to the original image with a few "edits". We will experiment with different starting indices in our strided timesteps, mainly $i_{start} \in \{1, 3, 5, 7, 10, 20\}$.
Berkeley Campanile |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
Nevada Beach, Lake Tahoe |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
Donner Lake, Truckee |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
The SDEdit algorithm works particularly well with non-realistic image, such as drawings, and projecting it onto the natural image manifold. Below are some examples of this algorithm with nonrealistic images.
Lightning McQueen, Cars |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
My Drawing of a Blue Car |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
My Drawing of Whale and Fish |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
Here, we implement the inpainting procedure. The inpainting procedure starts out with an image $x_{orig}$ and a binary mask $\mathbf{m}$, and creates a new image where $\mathbf{m} = 1$, while keeping the original image where $\mathbf{m} = 0$. At every step of the diffusion loop, after obtaining $x_{t'}$, we "force" $x_{t'}$ to have the same pixels as $x_{orig}$ where $\mathbf{m} = 0$ through the equation $$ x_{t'} \leftarrow \mathbf{m} x_{t'} + (\mathbf{1} - \mathbf{m}) \cdot f(x_{orig}, t') $$ where $f$ is a function of the forward process from earlier. Below are some results with the inpainting procedure implemented.
Berkeley Campanile |
Mask |
Masked Campanile |
Campanile Inpainted |
Mount Rushmore |
Mask |
Masked Mount Rushmore |
Mount Rushmore Inpainted |
Watermelons |
Mask |
Masked Watermelons |
Watermelons Inpainted |
Instead of projecting onto the image manifold using the prompt "a high quality photo", we will instead guide the projection down with a text prompt. This is done by changing the prompt "a high quality photo" into a more specific prompt. Below are some results where we change the prompt. Similar to previous parts, we will vary the noise levels to visualize the differences.
Berkeley Campanile |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
Dog |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
Oppenheimer |
$i_{start} = 1$ |
$i_{start} = 3$ |
$i_{start} = 5$ |
$i_{start} = 7$ |
$i_{start} = 10$ |
$i_{start} = 20$ |
In this part, we will create optical illusion, which are images that look like one thing, but when flipped upside down, the image looks like another thing. To do this, we need to adjust the way we calculate noise in our process. At step $t$, we will denoise an image $x_t$ with one prompt to optain $\epsilon_1$. At the same time, we will flip $x_t$ upside and denoise with a different prompt to get noise estimate $\epsilon_2$. We will flip $\epsilon_2$ right-side up, and then average the two noise estimates. The procedure can be summarized in the following equations. $$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$ $$ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) $$ $$ \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} $$ where $\text{UNet}$ is our model, $\text{flip}$ is a function that flips our image, and $p_1$ and $p_2$ are two different prompt embeddings. Our final estimate is $\epsilon$, and we proceed normally as before, applying CFG as well. Below are some results.
An Oil Painting of an Old Man |
An Oil Painting of People Around a Campfire |
A Painting of Albert Einstein |
An Painting of a Tiger |
An Oil Painting of a Red Panda |
An Oil Painting of a Fox |
A Drawing of a Walrus |
A Drawing of a Lamb |
Similar to Project 2, we can use diffusion models to create so-called hybrid images, those that look like one thing close up, and another thing farther away. To do so, we can create a composite noise $\epsilon$ by estimating the noise with two different prompts, and then combining the low frequencies of one noise estimate with the high frequencies of another noise estimate. This can be summarized as follows $$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$ $$ \epsilon_2 = \text{UNet}(x_t, t, p_2) $$ $$ \epsilon = {f_{lowpass}}(\epsilon_1) + {f_{highpass}}(\epsilon_2) $$ where $f_{lowpass}$ and $f_{highpass}$ are low and high pass functions from performing a gaussian blur, and $p_1$ and $p_2$ are two different text prompts. Our final estimate is $\epsilon$, and we proceed normally as before, applying CFG as well. Below are some results.
Hybrid Image of Skull and Waterfall |
Hybrid Image of Gym and Hamburger |
Hybrid Image of Bird and Feather |
Hybrid Image of NYC and Panda |
In this section of the project, we train a diffusion model on the MNIST dataset to generate images of MNIST digits. We'll first start by training a UNet to do single-step denoising, and then we will train a UNet to iteratively denoise by adding time conditioning and class conditioning. We will sample results along the way to see what the model outputs.
In this part, we implement a denoiser as a UNet. The UNet takes in an image with some level of noise added to it, and the UNet outputs a prediction of what the denoised images looks like. We will use the MNIST dataset, which consists of $28 \times 28$ pixel black and white images of digits. Below is a diagram of the UNet architecture.
UNet Architecture |
After implementing the UNet, we can then look at our dataset looks like. For the dataset, we need to generate training data pairs $(z, x)$, where each $x$ is a MNIST digit, and each $z$ is $x$ with some added noise. We use the following equation with to determine $z$. $$z = x + \sigma\epsilon, \; \epsilon \sim \mathcal{N}(0, \mathbf{I})$$ Here, we allow $\sigma$ to be $\sigma \in \{0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0 \}$, which controls the strength of the noise. A higher $\sigma$ value indicates more noise added. Below is a visualization of different $\sigma$ values across different digits.
$\sigma = 0.0$ | $\sigma = 0.2$ | $\sigma = 0.4$ | $\sigma = 0.5$ | $\sigma = 0.6$ | $\sigma = 0.8$ | $\sigma = 1.0$ |
---|---|---|---|---|---|---|
During the training process, the model will attempt to denoise noisy MNIST digits $z$ generated using $\sigma = 0.5$ applied to a digit $x$. For the UNet model, we will use a hidden dimension $D = 128$, optimized using mean squared error as the loss function. We will train using a batch size of $256$ for $5$ epochs. We will also use the Adam optimizer with a learning rate of $1 \times 10^{-4}$. Below is a plot of the training loss over each gradient descent step we take.
Below are some results from the model after the first and fifth epoch.
Input | Noisy, $\sigma = 0.5$ | Output |
---|---|---|
Input | Noisy, $\sigma = 0.5$ | Output |
---|---|---|
We can also see how our denoiser performs on various $\sigma$ values that it wasn't trained for.
Noisy with $\sigma = 0.0$ |
Denoised Image |
Noisy with $\sigma = 0.2$ |
Denoised Image |
Noisy with $\sigma = 0.4$ |
Denoised Image |
Noisy with $\sigma = 0.5$ |
Denoised Image |
Noisy with $\sigma = 0.6$ |
Denoised Image |
Noisy with $\sigma = 0.8$ |
Denoised Image |
Noisy with $\sigma = 1.0$ |
Denoised Image |
Up until now, our UNet model only predicted the clean image. In this part, we will make a slight change: instead of predicting the clean image, we predict the noise $\epsilon$ added. Eventually, we want to sample a pure noise $\epsilon \sim \mathcal{N}(0, \mathbf{I})$ and generate a realistic image $x$ by iteratively denoising. To iteratively denoise, we will use timesteps again, similar to previous sections. Using the equation $$ x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1 - \overline{\alpha}_t}\epsilon \; , \; \epsilon \sim \mathcal{N}(0, 1) $$ we want to generate a noisy image $x_t$ from $x_0$ for some time step $t \in \{0, 1, \dots, T\}$. When $t = 0$, $x_t$ is a clean image. When $t = T$, $x_t$ is pure noise. For $t \in \{1, \dots, T - 1 \}$, $x_t$ should be a linear combination of the clean image and noise. The derivations for $\beta$, $\alpha_t$, and $\overline{\alpha}_t$ can be found in the DDPM paper. For our use cases, we can set $T = 300$ since our dataset is simple.
Because the variance of $x_t$ varies with $t$, we cannot simply feed $x_t$ into our UNet. We need to condition our input on $t$ by adding fully connected blocks to our UNet. Below is an updated diagram of our UNet architecture.
Conditional UNet Architecture |
To train the model, we pick a random image from the training set, a random $t$, and train the denoiser to predict the noise in $x_t$. It follows the algorithm below.
For training, we will train for $20$ epochs using a batch size of $128$. For the UNet, we will use a hidden dimension $D = 64$. We will also use the Adam optimizer with an initial learning rate of $1 \times 10^{-3}$, along with a exponential learning rate decay scheduler with $\gamma = 0.1^{\frac{1}{20}}$. Below is the training loss curve for the time-conditioned UNet model.
We can also sample from our model, similar to previous sections. To sample, we use the following algorithm.
Below are the results from sampling the model after each epoch. Feel free to hover over the images to see a GIF of the sampling process!
Epoch 1 |
Epoch 5 |
Epoch 10 |
Epoch 15 |
Epoch 20 |
To better control the image generation, we can condition our UNet on the class of digits $\{0, \dots, 9\}$. To represent each class, we will use a one-hot vector. We will also add two additional fully-connected blocks to feed in our class. Moreover, we will also implement dropout, where 10% of the time, we drop the class conditioning by setting the one-hot vector to 0. This is so that our UNet will still work even if it isn't conditioned on a class.
To train, we will use the following algorithm. The algorithm is similar to the training algorithm before, with the added process of computing the one-hot vector for each image.
We will use the same training hyperparameters when training the time conditioned UNet. Below is a curve of the training losses.
To sample, we will use the same technique from part A of this project. We saw before that class conditional results aren't good unless we use classifier-free guidance. Thus, we will use classifier-free guidance with $\gamma = 5$.
Here are some results sampled after each epoch. Again, feel free to hover over the images to see a GIF of the sampling process!
Epoch 1 |
Epoch 5 |
Epoch 10 |
Epoch 15 |
Epoch 20 |
I really enjoyed learning about how diffusion models work. For project 5A, My favorite part was being able to generate my own images from scratch, and trying different prompts to see what works best! For project 5B, my favorite part was implementing the model myself and seeing how the images go from pure noise to a something clean at each and every timestep.