Description
1. Denoising diffusion models for generation
The first conceptual step in setting up a diffusion model is to choose the easy-to-sample distribution that we want to have at the end of the forward diffusion. For this, we choose m iid fair coin tosses (here we think of a fair coin as having a 50% chance of being +1 and a 50% chance of being -1) arranged into a vector.
Next, we need to choose a way to incrementally degrade the images. Let Y0 start with whatever image sample x we want to start with. At diffusion stage t, we generate Yt from Yt−1 by randomly flipping each pixel of Yt−1 independently with probability δ where δ is a small positive number.
It turns out that this process of rare pixel-flipping can be reinterpreted for easier analysis. For the j-th pixel at diffusion stage t, this process can alternatively be viewed as first flipping an independent coin Rt[j] with a probability 2δ of coming up +1 and then, if Rt[j] = +1 replacing Yt−1[j] with a freshly drawn independent fair coin Ft[j] that is equally likely to be −1 or +1. If Rt[j] ̸= +1, we leave that pixel alone i.e. Yt[j] = Yt−1[j].
(a) We need to verify that if we do this and diffuse for sufficiently many stages T, that the resulting distribution is close to looking like m i.i.d. fair coins. Show that the probability that pixel j has been replaced at some point by an independent fair coin by time T goes to 1 as T → ∞.
(HINT: It might be helpful to look at the probability that this has not happened…)
(b) To efficiently do diffusion training, we need a way to be able to quickly sample a realization of Yt starting from Y0 = xi. Give a procedure to sample a realization of Yt given Y0 without having to generate Y1,Y2,…,Yt−1.
(HINT: This should involve flipping at most two (potentially biased) coins for each pixel.)
(c) For the reverse diffusion process that will be used during image generation and is being learned during training, our goal is to approximate P(Yt−1|Yt) with a neural net that has learnable parameters θ. Suppose I give you a neural net whose input is a binary image Yt and whose output is m real numbers that could each in principle be from −∞ to +∞ (for example, these could be the outputs of a linear layer). Which of the following nonlinear activation functions would be most appropriate to convert them into a probability that we could use to sample whether the pixel in question should be a +1?
⃝ Sigmoid
⃝ ReLU max(0,x)
(d) The goal of training is to approximate a probability distribution for random denoising. However, we do not actually have access to P(Yt−1|Yt) and decide to use P(Yt−1|Yt,Y0) instead as a proxy.
What is P(Yt−1[j] = +1|Yt = y,Y0 = x)?
(1)
(2)
(3)
(4)
(HINT: What is the distribution for Yt−1[j] given Y0[j]?)
(e) Let the (conditional) probability distribution (on whether each pixel is +1) output by our neural net with nonlinearity be Qt(Yt). For training the denoising diffusion model q(Yt−1|Yt), we choose to use SGD loss DKL(P(Yt−1|Yt,Y0 = xi)||Qt(Yt)) where xi is the random training image drawn, t is the random time drawn from 1 to T, and Yt is the randomly sampled realization of the forward diffusion at time t starting with the image xi at time 0. This ends up being a loss on the vector of probabilities coming out of Qt(Yt) that can be written as a sum over the m entries in the vector of probabilities.
Given what you know about KL Divergence, what does this loss penalize most strongly? What does this loss look like at t = 1 in particular?
2. Diffusion Models
In the previous question we considered sampling from a discrete distribution. Let’s now see how iteratively adding Gaussian noise to a data point leads to a noisy sequence, and how the reverse process refines noise to generate realistic samples.
The classes of generative models we’ve considered so far (VAEs, GANs), typically introduce some sort of bottleneck (latent representation z) that captures the essence of the high-dimensional sample space (x). An alternate view of representing probability distributions p(x) is by reasoning about the score function i.e. the gradient of the log probability density function ∇x logp(x).
Given a data point sampled from a real data distribution x0 ∼ q(x), let us define a forward diffusion process iteratively adding small amount of Gaussian noise to the sample in T steps, producing a sequence of noisy samples x1,..xT .
T
q(xt|xt−1) = N(xt;p1 − βtxt−1,βtI) q(x1:T |x0) = Yq(xt|xt−1) (5)
t=1
The data sample x0 gradually loses its distinguishable features as the step t becomes larger. Eventually when T → ∞, xT is equivalent to an isotropic Gaussian distribution. (You can assume x0 is Gaussian).
To generative model is therefore the reverse diffusion process, where we sample noise from an isotropic Gaussian, and iteratively refine it towards a realistic sample by reasoning about q(xt−1|xt).
(a) Anytime Sampling from Intermediate Distributions
Given x0 and the stochastic process in eq. (5), show that there exists a closed form distribution for sampling directly at the tth time-step of the form
√
q(xt|x0) = N(xt; αtx0,(1 − αt)I)
(b) Reversing the Diffusion Process
Reversing the diffusion process from real to noise would allow us to sample from the real data distribution. In particular, we would want to draw samples from q(xt−1|xt). Show that given x0, the reverse conditional probability distribution is tractable and given by
q(xt−1|xt,x0) = N(xt−1;µ(xt,x0),βˆtI)
(Hint: Use Bayes Rule on eq. (5) , assuming that x0 is drawn from Gaussian q(x))
3. TinyML – Early Exit
As models get deeper and deeper, we spend a lot of compute on inference, passing each batch through the entirety of a deep model.
Early exit comes from the idea that the computationally difficulty of making predictions on some inputs is easier than others. In turn, these “easier” inputs won’t need to be processed through the entire model before a prediction can be made with reasonable confidence. These easier examples will exit early, and examples that are more difficult/have more variability in structure will need to be processed through more layers before making a reasonably confident prediction.
In short, we offer samples the option to be classified early, thus saving on the extra compute that would’ve been exhausted if full inference had been executed.
Early exit serves to save compute, decrease inference latency, all while maintaining a sufficient standard of accuracy.
(a) We consider a toy model of early exit, a series of cascading probability distributions.
We sample from each distribution in a sequence and add the result to a partial sum of all previously sampled values. The ith distribution is sampled from . Denote this as Xi. All Xi are independent. Denote and . Leave your answers in terms of Φ, the
CDF of the normal.
i. Calculate P(Y ≤ 0|Yk = M)
That is, if the value of after summing up the first k samples of our partial sum is M, what is the probability that our final sum will be less than 0. The kth partial sum can be seen as the value of the feature map at the kth layer in the neural network. Each sequential layer provides less new information, as each sequential distribution has a smaller variance.
ii. Calculate P(Y ≤ 0|Y1 = 5). Speculate why if we have only sampled the first distribution, but got a 5, we are pretty sure that the final value will not be less than 0.
iii. Calculate P(Y ≤ 0|Y40 = 0.00001). Speculate why even if we are so close to 0, after k = 40 we are very sure that the final value will not be less than 0. iv. How does this relate to early exit?
(b) Please complete hw12_early_exit.ipynb notebook on early exit then answer the following questions. Training should take less than 10 minutes per model. Please select a GPU runtime on colab. You should work on the analytic portion while training.
i. How does the baseline ResNet perform on the validation set?
ii. What is the validation accuracy of the early exit model?
iii. How often is the model exiting early? How confident is it when it exits early? How Confident is the model when it passes through the entire model?
iv. What is the MAC Ratio between the baseline model and the early exit model? Can you find a threshold that has a spike in change?
v. What is the validation accuracy of the smaller resnet model?
vi. Find the minimum threshold where the early exit accuracy is better than the small net accuracy.
Please report your findings of hyperparameter search.
4. Reinforcement Learning from Human Feedback
As the next chapter of our “Transformer for Summarization” series, we will delve into the application of reinforcement learning from human feedback (RLHF) for natural language processing tasks, as introduced in the InstructGPT paper (https://arxiv.org/pdf/2203.02155.pdf). Building on the foundations laid in our previous assignments, we will implement the RLHF algorithm to tackle the news summarization task.
First, we’ll explore the application of policy gradients for training sequence generation models. In every generation step of a sequence generator, it produces a probability distribution for the next token, given the prior tokens (and the source sequence, if it’s a sequence-to-sequence model):
Pθ(yi|y1,…,yi−1)
Considering the sequence generation model as an RL agent’s policy network, we can represent it as follows:
State Prior tokens y1,…,yi−1, along with the source sentence x in a sequence-to-sequence context.
Action Generating the next token yi.
Action space The entire token vocabulary.
Transition By producing token yi, the state transitions from y1,…,yi−1 to y1,…,yi.
Agent The policy network Pθ, which outputs a probability distribution over the vocabulary (action space) at each step.
Upon generating a sequence y = [y1,…,yn], it is evaluated (we will discuss evaluation methods later) to obtain a reward value r(y) (or r(x,y) in sequence-to-sequence generation).
(a) Prove that the policy gradients ∇θL(θ) for the sequence-to-sequence generation task are given by:
(b) In the last part, we established that the loss function
can be differentiated to obtain the policy gradients for sequence-to-sequence generation.
What is the relationship between the policy gradient loss and the cross-entropy loss in supervised sequence-to-sequence training?
The remainder of this assignment involves coding. Please follow the instructions in this notebook to learn how to implement RLHF for summarization. Once you have completed the coding portion, download submission_log.json and submit it to “Homework 12 (Code) (RLHF)” in Gradescope.
5. Homework Process and Study Group
We also want to understand what resources you find helpful and how much time homework is taking, so we can change things in the future if possible.
(a) What sources (if any) did you use as you worked through the homework?
(b) If you worked with someone on this homework, who did you work with?
List names and student ID’s. (In case of homework party, you can also just describe the group.)
(c) Roughly how many total hours did you work on this homework? Write it down here where you’ll need to remember it for the self-grade form.
Contributors:
• Anant Sahai.
• Kumar Krishna Agrawal.
• Liam Tan.
• Linyuan Gong.



