My Portfolio

Part A

Part 0: Setup

All the embeddings I generated: ['a high quality picture', 'an oil painting of a snowy mountain village', 'a photo of the amalfi coast', 'a photo of a man', 'a photo of a hipster barista', 'a photo of a puppy', 'a photo of a cat', 'a photo of dog fighting a cat', 'an oil painting of people around a campfire', 'an oil painting of an old man', 'a lithograph of waterfalls', 'a lithograph of a skull', 'a lithograph of book', 'a lithograph of a mountain', 'a lithograph of cat', 'a lithograph of a river', 'a man wearing a hat', 'a high quality photo', 'a rocket ship', 'a pencil', 'a photo of New York', 'a photo of mountains', 'a painting of sunflowers', 'a painting of village', '']

Here is the prompt I used for part 0, prompts = ['a photo of a puppy', 'a photo of a cat', 'a photo of dog fighting a cat']

seed = 100

num_inference_steps = 6

num_inference_steps = 20

By comparing two inference steps: we can see longer inference step can produc image with more colors and details, we can see that by comparing the two cat image, num_inference_steps=6 produces really blur cat image while num_inference_steps=20 produce fine-detailed cat It takes longer to generate correct picture for complicated prompts like a photo of dog fighting a cat. With num_inference_steps=6, there is almost no signs of fighting and with num_inference_steps=20, we can the see the fighting but it is not a dog and a cat fighting.

Part 1: Sampling Loops

1.1 Implementing the Forward Process

Using above formula, I implemented noisy_im = forward(im, t) function

1.2 Classical Denoising

1.3 One-Step Denoising

1.4 Iterative Denoising

On the ith denoising step we are at t = strided_timesteps[i], and want to get to t = strided_timesteps[i+1] (from more noisy to less noisy).

Using above formula we can create function def iterative_denoise(im_noisy, i_start, prompt_embeds, timesteps, display=True):

1.5 Diffusion Model Sampling

In iterative_denoise function, we set i_start = 0 and passing im_noisy as random noise.

1.6 Classifier-Free Guidance (CFG)

Use above formula combine with iterative_denoise function from Q4, we can create function def iterative_denoise_cfg

1.7 Image-to-image Translation

run the forward process to get a noisy image, and then run the iterative_denoise_cfg function using a starting index of [1, 3, 5, 7, 10, 20] steps. with the conditional text prompt "a high quality photo".

1.7.1 Editing Hand-Drawn and Web Images

Web Image

Hand-Drawn

1.7.2 Inpainting

Use above formula combine with def iterative_denoise_cfg to create inpaint function

1.7.3 Text-Conditional Image-to-image Translation

Prompt = a rocket ship

Prompt = a photo of mountains

Prompt = a photo of New York

1.8 Visual Anagrams

above algorithm plus iterative_denoise_cfg create visual_anagrams function

1.9 Hybrid Images

above algorithm plus iterative_denoise_cfg create make_hybrids function

Used Gaussian Blur to create lowpass, and subtract from lowpass to create high pass

Part B

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

Here is the structure of simple Unet

1.2 Using the UNet to Train a Denoiser

Using above formula, we can create different noise level for each image

visualization of the noising process using sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

1.2.1 Training

epoch 1

epoch 5

1.2.2 Out-of-Distribution Testing

sigma = 0

sigma = 0.2

sigma = 0.4

sigma = 0.5

sigma = 0.6

sigma = 0.8

sigma = 1.0

Denoiser performs relatively well when noise level less or close to 0.5. It's started to struggle to denoise the image when noise level become 0.8 or 1.0

1.2.3 Denoising Pure Noise

epoch 1

epoch 5

The general pattern of Denoising Pure Noise model is that it always produe a general shape that could possibly be any number. A blob of cloud in the middle

Q: What relationship, if any, do these outputs have with the training images (e.g., digits 0–9)? A: these outputs (a cloud of pixels) seems encampass all posible training images

Q: Why might this be happening? A: I think one reason might be it was trained with all 10 digits and it can't observe any strong direction to one digit during training since the input is pure noise, it decided to produce the output that can potentially match all 10 digits in some way so a cloud of pixels is always predicted. No time conditioned or class conditioned, the model tried to find "average point" for all possible output.

Part 2: Training a Flow Matching Model

2.1 Adding Time Conditioning to UNet

Use above structure to build time-conditioned Unet

2.2 Training the UNet

We will train time-conditioned model with following algorithm

2.3 Sampling from the UNet

We will generate time-conditioned model output with following algorithm

epoch 1

epoch 5

epoch 10

2.4 Adding Class-Conditioning to UNet

Adding classes during training by adding two more FCBlock with input class-conditioning vector, you make it a one-hot vector instead of a single scalar.

Because we still want our UNet to work without it being conditioned on the class (recall the classifer-free guidance you implemented in part a), we implement dropout where 10% of the time we drop the class conditioning vector c setting it to 0.