Description
Problem 1: Variational inference.
Standard VI minimizes KL (q(z) k p(z | x)), the Kullback-Leibler divergence from the variational approximation q(z) to the true posterior p(z | x). In this problem we will develop some intuition for this optimization problem. For further reference, see Chapter 10 of Pattern Recognition and Machine Learning by Bishop.
- Let Q = {q(z) : q(z)=QdD=1 N (zd | md, vd2)} denote the set of Gaussian densities on z ∈ RD with diagonal covariance matrices. Solve for
q? = arg min KL (q(z) k N (z | µ, Σ)) ,
Q
where Σ is an arbitrary covariance matrix.
- Now solve for q? ∈ Q that minimizes the KL in the opposite direction,
q? = arg min KL (N (z | µ, Σ) k q(z)) . Q
Your answer here.
- Plot the contour lines of your solutions to parts (a) and (b) for the case where
0 1 0.9 µ= 0 , Σ= 0.9 1 .
Problem 2: Variational autoencoders (VAE’s)
In class we derived VAE’s as generative models p(x, z; θ) of observations x ∈ RP and latent variables z ∈ RD, with parameters θ. We used variational expectation-maximization to learn the parameters θ that maximize a lower bound on the marginal likelihood,
N
X log p(x; θ)Eq(zn|xn,φ)[log p(xn, zn; θ)− log q(zn | xn, φ)]¬ L(θ, φ).
n=1
The difference between VAE’s and regular variational expectation-maximization is that we constrained the variational distribution q(z | x, φ) to be a parametric function of the data; for example, we considered,
q(zn | xn, φ)= N zn | µ(xn; φ), diag([σ12(xn; φ), . . . , σ2D(xn; φ)]) ,
where µ : RP → RD and σ2d : RP → R+ are functions parameterized by φ that take in a datapoint xn and output means and variances of zn, respectively. In practice, it is common to implement these functions with neural networks. Here we will study VAE’s in some special cases. For further reference, see Kingma
and Welling (2019), which is linked on the course website. (a) Consider the linear Gaussian model factor model,
p(xn, zn; θ)= N (zn; 0, I)N (xn | Azn, V),
where A ∈ RP×D, V ∈ RP×P is a diagonal, positive definite matrix, and θ =(A, V). Solve for the true posterior p(zn | xn, θ).
- Consider the variational family of Gaussian densities with diagonal covariance, as described above, and assume that µ(x; φ) and log σ2d(x; φ) are linear functions of x. Does this family contain the true posterior? Find the member of this variational family that maximizes L(θ, φ) for fixed θ. (Hint: use your answer to Problem 1a.)
Your answer here.
- Now consider a simple nonlinear factor model,
P
p(xn, zn; θ)=(zn | 0, I) YN (xnp | eaTp zn, vp),
p=1
parameterized by ap ∈ RD and vp ∈ R+. The posterior is no longer Gaussian, since the mean of xnp is a nonlinear function of the latent variable.[1]
Generate a synthetic dataset by sampling N = 1000 datapoints from a D = 1, P = 2 dimensional model with A =[1.2, 1]T and vp = 0.1 for p = 1, 2. Use the reparameterization trick and automatic differentiation to perform stochastic gradient descent on −L(θ, φ).
Make the following plots:
- A scatter plot of your simulated data (with equal axis limits).
- A plot of L(θ, φ) as a function of SGD iteration.
- A plot of the model parameters (A11, A21, v1, v2) as a function of SGD iteration.
- The approximate Gaussian posterior with mean µ(x; φ) and variance σ12(x; φ) for x ∈ {(0, 0), (1, 1), (10, 7)} using the learned parameters φ.
- The true posterior at those points. (Since z is one dimensional, you can compute the true posterior with numerical integration.)
Comment on your results.
Your results here.
Problem 3: Semi-Markov models
Consider a Markov model as described in class and in, for example, Chapter 13 of Pattern Recogntion and Machine Learning by Bishop,
T Y
p(z1:Tπ, A)= p(z1 | π) p(zt | zt−1, A),
t=2
where zt ∈ {1, . . . , K} denotes the “state,” and
p(z1 = i)=πi
p(zt = j | zt−1 = i, A)= Aij.
We will study the distribution of state durations—the length of time spent in a state before transitioning. Let d ≥ 1 denote the number of time steps before a transition out of state z1. That is, z1 = i, . . . , zd = i for some i, but zd+1 =6 i.
- Show that p(d | z1 = i, A)= Geom(d | pi), the probability mass function of the geometric distribution. Solve for the parameter pi as a function of the transition matrix A.
Your answer here.
- We can equivalently represent z1:T as a set of states and durations {(z˜n, dn)}Nn=1, where z˜n ∈
{spent in that state before transition. There is a one-to-one mapping between states1, . . . , K} \ {z˜n−1} denotes the index of the n-th visited state and dn ∈ N denotes the duration/durations and
the original state sequence:
(z1, . . . , zT)=(z˜1, . . . , z˜1, z˜2, . . . , z˜2, . . . z˜N, . . . , z˜N).
| {z } | {z } | {z } d1 times d2 times dN times
Show that the probability mass function of the states and durations is of the form
{ }N ˜1 | π)NY−1 p(dn | z˜n, A) p(z˜n+1 | z˜n, A) p(dN | z˜N, A),
p( (z˜n, dn) n=1)= p(z
n=1
and derive each conditional probability mass function.
Your answer here.
(c) Semi-Markov models replace p(dn | z˜n) with a more flexible duration distribution. For example, consider the model,
p(dn | z˜n)= NB(dn | r, θz˜n),
where r ∈ N and θk ∈ [0, 1] for k = 1, . . . , K. Recall from Assignment 1 that the negative binomial distribution with integer r is equivalent to a sum of r geometric random variables. Use this equivalence to write the semi-Markov model with negative binomial durations as a Markov model on an extended set of states sn ∈ {1, . . . , Kr}. Specifically, write the transition matrix for p(sn | sn−1) and the mapping from sn to zn.
[1] For this particular model, the expectations in L(θ,φ) can still be computed in closed form using the fact that E[ez]=eµ+σ2 for z∼N (µ,σ2).



