Name: Machine Learning-Homework 2 Solved
SKU: 67179
Availability: InStock

Description

5/5 - (1 vote)

Theory Questions

(Singletons (Section 3.5, Ex.2 in the course textbook). Let X be a discrete domain, and let H_Singleton= {h_z: z ∈ X} ∪ {h⁻}, where for each z ∈ X , h_zis the function defined by h_z(x) = 1 if x = z and h_z(x) = 0 if x 6= z. h⁻is simply the all-negative hypothesis, namely, ∀x ∈ X,h⁻(x) = 0.

The realizability assumption here implies that the true hypothesis f labels negatively all examples in the domain, perhaps except one.

Describe an algorithm that implements the ERM rule for learning H_Singletonin the realizable setup.

PAC in Expectation. Consider learning in the realizable case. We say an hypothesis class H is PAC learnable in expectation if there exists a learning algorithm A and a function N(a) : (0,1) → N such that ∀a ∈ (0,1) and for any distribution P, given a sample set S, such that |S| ≥ N(a) it holds that,

E[e_P(A(S))] ≤ a

Show that H is PAC learnable if and only if H is PAC learnable in expectation (Hint: Use Markov’s inequality and refer to derivations between equations 3.13-3.14 in the lecture scribes about VC).

Union Of Intervals. Determine the VC-dimension of the subsets of the real line formed by the union of k intervals (see question 1 of the programming assignment for a formal definition of H).
Right-angle Triangles. Determine the VC-dimension of the hypotheses class H, of axis-aligned right-angle triangles in the plane, with the right angle in the lower left corner.
Structural Risk Minimization. Let H be a countable hypothesis class, that is, H can be written as H = ^S_i_∈_N{h_i}. Let w : H → [0,1] be a function such that ^Ph∈H w(h) ≤ 1. We refer to w as a weight function over the hypotheses which reflects the prior for each hypothesis.

(a) Show that with probability 1 − δ over the choice S ∼ P (|S| = n)

Hint: use the uniform convergence property for each hypothesis class {h_i} of size

(b) Denote. Show that with probability 1 − δ:

where h_min= argmin e_P(h).

h∈H

This implies that when the hypothesis with optimal error has high weight, it will be learned from few samples.

Loss Minimization. Consider binary classification with the following loss function:



0 y = yˆ



∆_b(y,yˆ) = 0.5 y = 1,yˆ = 0

1 y = 0,yˆ = 1

Find the

Programming Assignment

Union Of Intervals. In this question, we will study the hypothesis class of a finite union of disjoint intervals, and the properties of the ERM algorithm for this class.

To review, let the sample space be X = [0,1] and assume we study a binary classification problem, i.e. Y = {0,1}. We will try to learn using an hypothesis class that consists of k intervals. More explicitly, let I = {[l₁,u₁],…,[l_k,u_k]} be k disjoint intervals, such that 0 ≤ l₁≤ u₁≤ l₂≤ u₂≤ … ≤ u_k≤ 1. For each such k disjoint

intervals, define the corresponding hypothesis as

(

1 if x ∈ [l₁,u₁],…,[l_k,u_k]

h_I(x) =

0 otherwise

Finally, define H_kas the hypothesis class that consists of all hypotheses that correspond to k disjoint intervals:

H_k= {h_I|I = {[l₁,u₁],…,[l_k,u_k]}, 0 ≤ l₁≤ u₁≤ l₂≤ u₂≤ … ≤ u_k≤ 1}

We note that H_k⊆ H_k₊₁, since we can always take one of the intervals to be of length 0 by setting its endpoints to be equal. We are given a sample of size m = hx₁,y₁i,…,hx_n,y_mi. Assume that the points are sorted, so that 0 ≤ x₁< x₂< … < x_m≤ 1.

(a) Assume that the true distribution P[x,y] = P[y|x] · P[x] is: x is distributed uniformly on the interval [0,1], and

(

0.8 if x ∈ [0,0.2] ∪ [0.4,0.6] ∪ [0.8,1]

P[y = 1|x] =

0.1 if x ∈ [0.2,0.4] ∪ [0.6,0.8]

and P[y = 0|x] = 1 − P[y = 1|x].

Write a function that draws m pairs of (x,y) according to the distribution P. Use it to draw a sample of size m = 100 and create a plot:

Plot the points and their label (have the y axis in range −0.1,1.1 for clarity of presentation).
Mark the lines x = 0.2,0.4,0.6,0.8 clearly on the plot.

Run the find_best_interval function on your sample with k = 3, and plot the intervals clearly.

Note that here, we know the true distribution P, so for every given hypothesis h ∈ H_k, we can calculate error(h) precisely. What is the hypothesis with the smallest error?
Write a function that, given a list of intervals, calculates the error of the respective hypothesis. Then, for k = 3, m = 10,15,20,…,100, perform the following experiment T = 100 times: (i) Draw a sample of size m and run the ERM algorithm on it; (ii) Calculate the empirical error for the returned hypothesis; (iii) Calculate the true error for the returned hypothesis. Plot the average empirical and true errors, averaged across the T runs, as a function of m. Discuss the results. Do the empirical and true error decrease or increase in m? Why?
Draw a data set of m = 1500 samples. Find the best ERM hypothesis for k = 1,2,…,20, and plot the empirical and true errors as a function of k. How does the error behave? Define k^∗to be the k with the smallest empirical error for ERM? Does this mean the hypothesis with k^∗intervals is a good choice?
Now we will use the principle of structural risk minimization (SRM), to search for a k that gives good test error. Let δ = 0.1:
- Use your results from question 3 in the theoretical part and the VC confidence bound to construct a penalty function.
- Draw a data set of m = 1500 samples, run the experiment in (d) again, but now plot two additional lines as a function of k: 1) the penalty for the best ERM hypothesis and 2) the sum of penalty and empirical error.
- What is the best value for k in each case? is it better than the one you chose in (d)?

(f) Here we will use holdout-validation to search for a k that gives good test error. Draw a data set of m = 1500 samples and use 20% for a holdoutvalidation. Choose the best hypothesis based on 3 experiments. Discuss how close this gets you to finding the hypothesis with optimal true error.