Project 5: Diffusion Models

1.1 Implementing the Forward Process

In this section, we implement the forward process of diffusion models using the formula:

$$ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1) $$

Key variables:

$x_t$: Image after adding noise
$\bar\alpha_t$: Noise coefficient for step $t$
$x_0$: Original clean image
$\epsilon$: Random noise

Steps:

Add varying levels of noise to the Berkeley Campanile image.
Generate noisy images for $t=250, 500, 750$.

1.2 Classical Denoising

In this section, Gaussian blur is applied to noisy images generated from the forward process to evaluate the denoising quality. Steps:

Use the forward() function to generate images at different noise levels.
Apply Gaussian blur (kernel_size=5) to denoise the images.
Compare the denoising results for $t=250, 500, 750$.

Noisy vs. Gaussian Blur Denoising Campanile at t=250 — Noisy vs. Gaussian Blur Denoising Campanile at $t=250$

Noisy vs. Gaussian Blur Denoising Campanile at t=500 — Noisy vs. Gaussian Blur Denoising Campanile at $t=500$

Noisy vs. Gaussian Blur Denoising Campanile at t=750 — Noisy vs. Gaussian Blur Denoising Campanile at $t=750$

1.3 One-Step Denoising

In this section, the goal is to denoise images in one step by predicting noise using a UNet model and reconstructing the original image based on the given formula:

$$ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon $$

Steps:

Generate noisy images using the forward() function.
Use a UNet model to predict the noise $ \epsilon $.
Reconstruct the original image $ x_0 $ using the formula:

$$ x_0 = \frac{x_t - \sqrt{1 - \bar\alpha_t} \epsilon}{\sqrt{\bar\alpha_t}} $$

Iteratively Denoised — Noisy Campanile vs. One-Step Denoised Campanile at t=250

Gaussian Blur — Noisy Campanile vs. One-Step Denoised Campanile at t=500

One-Step Denoising — Noisy Campanile vs. One-Step Denoised Campanile at t=750

1.4 Iterative Denoising

In this section, the iterative denoising process is performed by gradually refining the noisy image using the formula:

$$ x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma $$

Steps:

Create a sequence of timesteps from $t=990$ to $t=0$ with a step size of 30.
Iteratively denoise the image using the formula above.
Compare the results of iterative denoising, one-step denoising, and Gaussian blur denoising.

1.5 Diffusion Model Sampling

In this section, we generate images from random noise by applying the iterative denoising process guided by a text prompt.

Steps:

Start with random noise as the input.
Iteratively denoise the image starting from the maximum noise level ($i_{start}=0$).
Use the text prompt "a high-quality photo" to guide the generation process.

1.6 Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) improves image quality by enhancing the conditional noise estimation based on the formula:

$$ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) $$

$\epsilon_c$: Conditional noise estimation (guided by the text prompt).
$\epsilon_u$: Unconditional noise estimation (empty prompt "").
$\gamma$: CFG scale factor (set to 7 in this example).

Steps:

Run the UNet twice to obtain conditional and unconditional noise estimations.
Combine the estimations using the CFG formula.
Use the enhanced denoising process to generate higher-quality images.

1.7 Image-to-Image Translation

1.7.1 Editing Hand-Drawn and Web Images

Steps:

Add noise to the original image using the forward function.
Denoise the image while preserving key features.
Test different noise levels ($i_{start}=1, 3, 5, 7, 10, 20$).

1.7.2 Inpainting

Steps:

Initialize the noisy image and apply a mask.
Replace the masked region with noise and preserve the unmasked region.
Iteratively refine the masked region to generate a complete image.

1.7.3 Text-Conditional Image-to-Image Translation

In this section, specific text prompts are used to guide the image generation process. The noise level controls how much of the original image's features are retained. Steps:

Add noise to the original image using the forward function.
Denoise the image using a text prompt to guide the generation.
Test the effect of different noise levels ($1, 3, 5, 7, 10, 20$).

1.8 Visual Anagrams

In this section, we create visual anagrams by averaging noise estimations from two different prompts, one for the image and one for its flipped version.

Key formulas:

Noise estimation for original image: $ \epsilon_1 = UNet(x_t, t, p_1) $
Noise estimation for flipped image: $ \epsilon_2 = flip(UNet(flip(x_t), t, p_2)) $
Averaged noise estimation: $ \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} $

Start with a random noisy image.
Apply prompts like "An Oil Painting of an Old Man" and "An Oil Painting of People around a Campfire."
Combine noise estimations to iteratively refine the image.

Hybrid Image: A Skull and a Waterfall — an oil painting of people around a campfire

Hybrid Image — a lithograph of waterfalls

1.9 Hybrid Images

In this section, we generate hybrid images by combining the low-frequency and high-frequency components of two images based on different prompts.

Key formulas:

Noise estimation for first prompt: $ \epsilon_1 = UNet(x_t, t, p_1) $
Noise estimation for second prompt: $ \epsilon_2 = UNet(x_t, t, p_2) $
Hybrid noise estimation: $ \epsilon = f_{lowpass}(\epsilon_1) + f_{highpass}(\epsilon_2) $

Steps:

Generate low-frequency components using Gaussian blur on $ \epsilon_1 $.
Generate high-frequency components by subtracting Gaussian blur from $ \epsilon_2 $.
Combine both components and use the diffusion model to update the image.
Repeat multiple iterations to refine the hybrid image.