Name: STATS215-Assignment 3 Solved
SKU: 70446
Availability: InStock

Description

Rate this product

Problem 1: Variational inference.

Standard VI minimizes KL (q(z) k p(z | x)), the Kullback-Leibler divergence from the variational approximation q(z) to the true posterior p(z | x). In this problem we will develop some intuition for this optimization problem. For further reference, see Chapter 10 of Pattern Recognition and Machine Learning by Bishop.

Let Q = {q(z) : q(z)=^Q_d^D₌1 N (z_d| m_d, v_d²)} denote the set of Gaussian densities on z ∈ R^Dwith diagonal covariance matrices. Solve for

q^?= arg min KL (q(z) k N (z | µ, Σ)) ,

where Σ is an arbitrary covariance matrix.

Now solve for q^?∈ Q that minimizes the KL in the opposite direction,

q^?= arg min KL (N (z | µ, Σ) k q(z)) . Q

Your answer here.

Plot the contour lines of your solutions to parts (a) and (b) for the case where

0 1 0.9 µ= 0 , Σ= 0.9 1 .

Problem 2: Variational autoencoders (VAE’s)

In class we derived VAE’s as generative models p(x, z; θ) of observations x ∈ R^Pand latent variables z ∈ R^D, with parameters θ. We used variational expectation-maximization to learn the parameters θ that maximize a lower bound on the marginal likelihood,

X log p(x; θ⁾Eq(z_n|x_n,_φ)[log p(x_n, z_n; θ⁾− log q(z_n| x_n, φ)]^¬L(θ, φ).

n₌1

The difference between VAE’s and regular variational expectation-maximization is that we constrained the variational distribution q(z | x, φ) to be a parametric function of the data; for example, we considered,

q(z_n| x_n, φ)= N z_n| µ(x_n; φ), diag([σ₁²(x_n; φ), . . . , σ²_D(x_n; φ)]) ,

where µ : R^P→ R^Dand σ²_d: R^P→ R₊are functions parameterized by φ that take in a datapoint x_nand output means and variances of z_n, respectively. In practice, it is common to implement these functions with neural networks. Here we will study VAE’s in some special cases. For further reference, see Kingma

and Welling (2019), which is linked on the course website. (a) Consider the linear Gaussian model factor model,

p(x_n, z_n; θ)= N (z_n; 0, I)N (x_n| Az_n, V),

where A ∈ R^P×^D, V ∈ R^P×^Pis a diagonal, positive definite matrix, and θ =(A, V). Solve for the true posterior p(z_n| x_n, θ).

Consider the variational family of Gaussian densities with diagonal covariance, as described above, and assume that µ(x; φ) and log σ²_d(x; φ) are linear functions of x. Does this family contain the true posterior? Find the member of this variational family that maximizes L(θ, φ) for fixed θ. (Hint: use your answer to Problem 1a.)

Your answer here.

Now consider a simple nonlinear factor model,

p(xn, zn; θ)=(zn | 0, I) YN (xnp | eaTp zn, vp),

p=1

parameterized by a_p∈ R^Dand v_p∈ R₊. The posterior is no longer Gaussian, since the mean of x_npis a nonlinear function of the latent variable.^[1]

Generate a synthetic dataset by sampling N = 1000 datapoints from a D = 1, P = 2 dimensional model with A =[1.2, 1]^Tand v_p= 0.1 for p = 1, 2. Use the reparameterization trick and automatic differentiation to perform stochastic gradient descent on −L(θ, φ).

Make the following plots:

A scatter plot of your simulated data (with equal axis limits).
A plot of L(θ, φ) as a function of SGD iteration.
A plot of the model parameters (A11, A21, v1, v2) as a function of SGD iteration.
The approximate Gaussian posterior with mean µ(x; φ) and variance σ₁²(x; φ) for x ∈ {(0, 0), (1, 1), (10, 7)} using the learned parameters φ.
The true posterior at those points. (Since z is one dimensional, you can compute the true posterior with numerical integration.)

Comment on your results.

Your results here.

Problem 3: Semi-Markov models

Consider a Markov model as described in class and in, for example, Chapter 13 of Pattern Recogntion and Machine Learning by Bishop,

T Y

p(z1:_Tπ, A)= p(z1 | π⁾p(z_t| z_t−1, A),

t₌₂

where z_t∈ {1, . . . , K} denotes the “state,” and

p(z₁= i)=π_i

p(z_t= j | z_t−1 = i, A)= A_ij.

We will study the distribution of state durations—the length of time spent in a state before transitioning. Let d ≥ 1 denote the number of time steps before a transition out of state z₁. That is, z₁= i, . . . , z_d= i for some i, but z_d₊1 ⁼6 i.

Show that p(d | z₁= i, A)= Geom(d | p_i), the probability mass function of the geometric distribution. Solve for the parameter p_ias a function of the transition matrix A.

Your answer here.

We can equivalently represent z_1:Tas a set of states and durations {(z˜_n, d_n⁾}^N_n₌1, where z˜_n∈

{spent in that state before transition. There is a one-to-one mapping between states1, . . . , K} \ {z˜_n−1} denotes the index of the n-th visited state and d_n∈ N denotes the duration/_{durations and}

the original state sequence:

(z₁, . . . , z_T)=(z˜₁, . . . , z˜₁, z˜₂, . . . , z˜₂, . . . z˜_N, . . . , z˜_N).

| {z } | {z } | {z } d₁times d₂times d_Ntimes

Show that the probability mass function of the states and durations is of the form

{ }^N˜₁| π)N^Y−1 p(d_n| z˜_n, A) p(z˜_n+1 | z˜_n, A) p(d_N| z˜_N, A),

p( (z˜_n, d_n) _n₌1)= p(z

n=1

and derive each conditional probability mass function.

Your answer here.

(c) Semi-Markov models replace p(d_n| z˜_n) with a more flexible duration distribution. For example, consider the model,

p(d_n| z˜_n)= NB(d_n| r, θ_z_˜_n),

where r ∈ N and θ_k∈ [0, 1] for k = 1, . . . , K. Recall from Assignment 1 that the negative binomial distribution with integer r is equivalent to a sum of r geometric random variables. Use this equivalence to write the semi-Markov model with negative binomial durations as a Markov model on an extended set of states sn ^{∈ {}1, . . . , Kr}. Specifically, write the transition matrix for p(sn | sn−¹) _{and the mapping from sn}_{to zn}_.

[1] For this particular model, the expectations in L(θ,φ) can still be computed in closed form using the fact that E[e^z]=e^µ⁺^σ²for z∼N (µ,σ2).

[SOLVED] STATS215-Assignment 3

If Helpful Share:

Description

p(z1 = i)=πi

p(dn | z˜n)= NB(dn | r, θz˜n),

Related products

STATS215-Assignment 4

STATS215-Assignment 1

STATS215-Assignment 2

Related in this category

More in this category

STATS215-Assignment 4

STATS215-Assignment 1

STATS215-Assignment 2

p(z₁= i)=π_i

p(d_n| z˜_n)= NB(d_n| r, θ_z_˜_n),