Description
1. Coding Question: Summarization (Part II)
Please follow the instructions in this notebook. You will learn how to efficiently enable the Transformer encoder-decoder model to generate sequences. Then, you will fine-tune another Transformer encoderdecoder model based on the pretrained language model T5 for the same task.
Note: this notebook takes 45 min to 1 hour to run after you implemented everything correctly. Please start early.
• Download submission_log.json and submit it to “Homework 11 (Code) (Summarization)” in Gradescope.
• Answer the following questions in your submission of the written assignment:
(a) What are the ROUGE-1, ROUGE-2, and ROUGE-L scores of the model trained from scratch on the summarization task?
(b) In the T5 paper, which optimizer is employed for training T5, and what are the peak learning rates for pretraining and finetuning?
(c) What are the ROUGE-1, ROUGE-2, and ROUGE-L scores of the finetuned model, and how do these scores compare to those of the model trained from scratch?
(d) Provide appropriate prompts for the summarization task, apply the prompt to the example documents, and fed them into an off-the-shelf large language model (e.g., ChatGPT). Specify the model used, and share the generated outputs. Compare the outputs with your finetuned model’s outputs qualitatively and identify the main reason for any differences.
2. Comparing Distributions
Divergence metrics provide a principled measure of difference between a pair of distributions (P, Q). One such example is the Kullback-Leibler Divergence, that is defined as
(a) Technically DKL is not a true distance since it is asymmetric, i.e. generally DKL(P||Q) ̸= DKL(Q||P).
Give an example of univariate distributions P and Q where DKL(P||Q) ̸= ∞, DKL(Q||P) = ∞.
From the following plots, identify which of (A, B) correspond to minimizing forward vs. reverse KL. Give brief reasoning. Here, only the mean and standard deviation of Q is allowed to vary during the minimization.
3. Continual Learning (Optional)
Run this notebook and answer the questions. We will explore some strategies that we can mitigate catastrophic forgetting when our neural network model sequentially learns the new tasks. Let’s try and compare three classic methods: 1) naive 2) Elastic Weight Consolidation (EWC) and 3) Rehearsal.
(a) Naive approach
i. What do you observe? How much does the network forget from the previous tasks? Why do you think this happens?
ii. (Open-ended question) We are using CNN. Does MLP perform better or worse than CNN? Try it out and report your results.
(b) Elastic Weight Consolidation
i. Hyperparameter is underexplored in this assignment. Try different values of λ and report your results.
ii. What is the role of λ? What happens if λ is too small or too large? Explain the results with plasticity and stability of the network.
(c) Rehearsal
i. What would be the pros and cons of rehearsal?
4. Variational AutoEncoders
For this problem we will be using PyTorch to implement the variational autoencoder (VAE) and learn a probabilistic model of the MNIST dataset of handwritten digits. Formally, we observe a sequence of binary pixels x ∈ {0,1}d and let z ∈ Rk denote a set of latent variables. Our goal is to learn a latent variable model pθ(x) of the high-dimensional data distribution pdata(x).
The VAE is a latent variable model with a specific parameterization pθ(x) = R pθ(x,z)dz = R p(z)pθ(x|z)dz Specifically, VAE is defined by the following generative process (often called reparameterization trick):
p(z) = N(z|0,I) (sample noise from standard Gaussian) pθ(x|z) = Bern(x|fθ(z)) (decode noise to generate sample from real-distribution)
That is, we assume that the latent variables z are sampled from a unit Gaussian distribution N(z|0,I). The latent z are then passed through a neural network decoder fθ(·) to obtain the parameters of the d Bernoulli random variables that model the pixels in each image.
To learn the parameterized distibution we would like to maximize the marginal likelihood pθ(x). However computing pθ(x) = R p(z)pθ(x|z)dz is generally intractable since this requires integrating over all possible values of z ∈ R. Instead, we consider a variational approximation to the true posterior
,diag(
In particular, we pass each image x through a neural network that outputs mean µϕ and diagonal covariance diag(σϕ2(x)) of the multivariate Gaussian distribution that approximates the distribution over the latent variables z given x. The high level intuition for training parameters (θ,ϕ) requires considering two expressions:
• Decoding Latents : Sample latents from qϕ(z), maximize likelihood of generating samples x ∼ pdata
• Matching Prior : A Kullback-Leibler (KL) term to constraint qϕ(z) to be close to the p(z)
Putting these terms together, gives us a lower-bound of the true marginal log-likehood, called the evidence lower bound (ELBO):
logpθ(x) ≥ ELBO(x;θ,ϕ) = Eqϕ(z|x)[logpθ(x|z)] − DKL(qϕ(z|x)||p(z)))
| {z } | Matching Prior{z }
Decoding Latents
In this notebook, implement the reparameterization trick in the function sample_gaussian. Specifically, your answer will take in the mean m and variance v of the Gaussian and return a sample x ∼ N(m,diag(v))
Then, implement negative_elbo_bound loss function.
Note: We ask for the negative ELBO, as PyTorch optimizers minimze the loss function. Furthere, since we are computing the negative ELBO over a mini-batch of data , make sure to compute the average of per-sample ELBO. Finally, note that the ELBO itself cannot be computed exactly since computation of the reconstruction term is intractable. Instead, you should estaimate the reconstruction term via Monte-Carlo sampling
−Eqϕ(z|x)[logpθ(x|z)] ≈ −logpθ(x|z(1))
where z(1) ∼ qϕ(z|x) denotes a single sample from the learned posterior.
The negative_elbo_bound expects as output three quantities: average negative ELBO, reconstruction loss, KL divergence.
Answer these questions:
(a) Test your implementation by training VAE with
python experiment.py –model vae
Once the run is complete (10000 iterations), report the following numbers : the average
• negative ELBO
• KL-Divergence term
• reconstruction loss
(b) Visualize 200 digits (generate a single image tiled in a grid of 10 × 20 digits) sampled from pθ(x)
5. Generative Adversarial Networks (Optional)
Unlike VAEs, that explicitily model data distributions with likelihood-based training, Generative Adversarial Networks (GANs) belong to the family of implicit generative models.
To model high-dimensional data distributions pdata(x) (with x ∈ Rn), define
• a generator Gθ : Rk → Rn
• a discriminator Dϕ : Rn → (0,1)
To obtain samples from the generator, we first sample a k-dimensional random vector z ∼ N(0,1) and return Gθ(z) ∈ Rn. The discriminator is effectively a classifer that judges how realistic the fake image Gθ(z) are, compared to real samples from the data distribution x ∼ pdata(x). Because its output is intended to be interpreted as a probability, the last layer of the discriminator is frequently the sigmoid function,
such that σ(x) ∈ (0,1). Therefore, for logits hϕ(x), discriminator output is Dϕ(x) = σ(hϕ(x)).
For training GANs we define learning objectives Ldiscriminator(ϕ;θ) and Lgenerator(θ;ϕ) that are optimized iteratively in two-stages with gradient descent. In particular, we take a gradient step to minimize
Ldiscriminator
Real Data Generated Data
Training a GAN can be viewed as solving the following minimax optimization problem, for generator Gθ and discriminator Dϕ:
(a) Vanishing Gradient with Minimax Objective
Rewriting the above loss in terms of discriminator logits, sigmoid we have
Show that when discriminator output Dϕ(Gθ(z)) ≈ 0. Why is this problematic for training the generator when the discriminator is well-trained in identifying fake samples?
(b) GANs as Divergence Minimization
To build intuition about the training objective, consider the distribution pθ(x) corresponding to:
x = Gθ(z) where z ∼ N(0,I)
• Optimal Discriminator
The discriminator minimizes the loss
Ldiscriminator
For a fixed generator θ, show that the discriminator loss is minimized when .
• Generator Loss
For a fixed generator θ, and corresponding optimal discriminator , show that the minimax objective V (G,D∗) satisfies
V (G,D∗) = −log4 + 2DJSD(pdata||pθ)
where DJSD(p||q) is the Jenson-Shannon Divergence.
Note: A divergence measures the distance between two distributions p,q. In particular, for distributions p,q with common support X, typically used divergence metrics include
(Kullback-Leibler Divergence)
(Jensen-Shannon Divergence)
(c) Training GANs on MNIST
To mitigate vanishing gradients during training, ? propose the non-saturating loss
Lnsgenerator(θ;ϕ) = −Ez∼N(0,I)hlogDϕ(Gθ(z))i
For mini-batch approximation, we use Monte-Carlo estimates of the learning objectives, such that
Ldiscriminator
Lnsgenerator
for batch-size m, and batches of real-data x(i) ∼ pdata(x) and fake-data z(i) ∼ N(0,I). Following these details, implement training for GANs with above learning objectives by filling relevant snippets in gan.py. Test your implementation by running
python experiment.py –model gan
Visualize 200 digits (generate a single image tiled in a grid of 10 × 20 digits) sampled from pθ(x)
6. 0th Order Optimization – Policy Gradient
We will now talk about 0th order optimization, also known as Policy Gradient in a Reinforcement Learning context. Although this method is primarily used in an RL context we will be adapting this method to do 0th order optimization on a Neural Network.
kth order optimization means that in the optimization, we use a kth order derivative ( ) to do the optimization. So we can see that gradient descent is a first order optimization method, while Newton’s method is a second order optimization method.
Polciy gradient is a 0th order optimization method – which means that you use no derivative for the optimzation. This is used in contexts in which the loss is a blackboxed function, hence propogating a gradient through it is impossible.
Policy gradient at a high level approximates the gradient and then does gradient descent using this approximated gradient.
(a) Prove that
(b) Let’s say we have a neural network f(x) which takes in a x and uses the weights(w) to output 2 logits (P = [P(y = 0), P(y = 1)]).
Let p(x,y) be the joint distribution of the input and output data according to our model. Hence pw(x,y) = p(x)pw(y|x), where p(x) is the ground distribution of x, while pw(y|x) = f(x)[y] is what our model predicts.
Similarly we have a blackboxed loss function L(x,f(x)) which outputs a loss. For example if I wanted to learn to classify y = 1 if x > 5 and y = 0 otherwise, L(4,(0.1,0.9)) would be small while L(4,(0.9,0.1)) would be very high. As we already discussed, since this loss is blackboxed we can’t take the derivative through it.
We want to optimize the following objective function
w∗ = argminwJ(w)
where
J(w) = E(x,f(x))∼pw(x,y)[L(x,f(x))].
To do this optimization we want to approximate ∇wJ(w) so that we could use an optimization method like gradient descent to find w∗
Prove that ∇wJ(w) can be approximated as
Hints:
• Try creating a τ = (x,f(x))
• E[X] = Rx xP(X = x)dx
• Use the result from part a which was
• pw(x,y) = p(x)pw(y|x)
(c) The following two parts are based on this notebook, where you need to implement policy gradient according to your derivation before answering these questions.
Include the screenshot of the accuracy plot here. With a correct implementation, you should observe a test accuracy of approximately 75% after the final iteration.
(d) Compare the policy gradient and supervised learning approaches for this classification task, focusing on their convergence speed, stability, and final performance. Explain any observed differences.
7. Homework Process and Study Group
We also want to understand what resources you find helpful and how much time homework is taking, so we can change things in the future if possible.
(a) What sources (if any) did you use as you worked through the homework?
(b) If you worked with someone on this homework, who did you work with?
List names and student ID’s. (In case of homework party, you can also just describe the group.)
(c) Roughly how many total hours did you work on this homework? Write it down here where you’ll need to remember it for the self-grade form.
Contributors:
• Linyuan Gong.
• Kumar Krishna Agrawal.
• Suhong Moon.
• Dhruv Shah.
• Anant Sahai.
• Aditya Grover.
• Stefano Ermon.
• Yashish Mohnot.



