Description
Theory Questions
- (Singletons (Section 3.5, Ex.2 in the course textbook). Let X be a discrete domain, and let HSingleton = {hz : z ∈ X} ∪ {h−}, where for each z ∈ X , hz is the function defined by hz(x) = 1 if x = z and hz(x) = 0 if x 6= z. h− is simply the all-negative hypothesis, namely, ∀x ∈ X,h−(x) = 0.
The realizability assumption here implies that the true hypothesis f labels negatively all examples in the domain, perhaps except one.
Describe an algorithm that implements the ERM rule for learning HSingleton in the realizable setup.
- PAC in Expectation. Consider learning in the realizable case. We say an hypothesis class H is PAC learnable in expectation if there exists a learning algorithm A and a function N(a) : (0,1) → N such that ∀a ∈ (0,1) and for any distribution P, given a sample set S, such that |S| ≥ N(a) it holds that,
E[eP(A(S))] ≤ a
Show that H is PAC learnable if and only if H is PAC learnable in expectation (Hint: Use Markov’s inequality and refer to derivations between equations 3.13-3.14 in the lecture scribes about VC).
- Union Of Intervals. Determine the VC-dimension of the subsets of the real line formed by the union of k intervals (see question 1 of the programming assignment for a formal definition of H).
- Right-angle Triangles. Determine the VC-dimension of the hypotheses class H, of axis-aligned right-angle triangles in the plane, with the right angle in the lower left corner.
- Structural Risk Minimization. Let H be a countable hypothesis class, that is, H can be written as H = Si∈N{hi}. Let w : H → [0,1] be a function such that Ph∈H w(h) ≤ 1. We refer to w as a weight function over the hypotheses which reflects the prior for each hypothesis.
(a) Show that with probability 1 − δ over the choice S ∼ P (|S| = n)
Hint: use the uniform convergence property for each hypothesis class {hi} of size
1.
(b) Denote. Show that with probability 1 − δ:
,
where hmin = argmin eP(h).
h∈H
This implies that when the hypothesis with optimal error has high weight, it will be learned from few samples.
- Loss Minimization. Consider binary classification with the following loss function:
0 y = yˆ
∆b(y,yˆ) = 0.5 y = 1,yˆ = 0
1 y = 0,yˆ = 1
Find the
Programming Assignment
- Union Of Intervals. In this question, we will study the hypothesis class of a finite union of disjoint intervals, and the properties of the ERM algorithm for this class.
To review, let the sample space be X = [0,1] and assume we study a binary classification problem, i.e. Y = {0,1}. We will try to learn using an hypothesis class that consists of k intervals. More explicitly, let I = {[l1,u1],…,[lk,uk]} be k disjoint intervals, such that 0 ≤ l1 ≤ u1 ≤ l2 ≤ u2 ≤ … ≤ uk ≤ 1. For each such k disjoint
intervals, define the corresponding hypothesis as
(
1 if x ∈ [l1,u1],…,[lk,uk]
hI(x) =
0 otherwise
Finally, define Hk as the hypothesis class that consists of all hypotheses that correspond to k disjoint intervals:
Hk = {hI|I = {[l1,u1],…,[lk,uk]}, 0 ≤ l1 ≤ u1 ≤ l2 ≤ u2 ≤ … ≤ uk ≤ 1}
We note that Hk ⊆ Hk+1, since we can always take one of the intervals to be of length 0 by setting its endpoints to be equal. We are given a sample of size m = hx1,y1i,…,hxn,ymi. Assume that the points are sorted, so that 0 ≤ x1 < x2 < … < xm ≤ 1.
(a) Assume that the true distribution P[x,y] = P[y|x] · P[x] is: x is distributed uniformly on the interval [0,1], and
(
0.8 if x ∈ [0,0.2] ∪ [0.4,0.6] ∪ [0.8,1]
P[y = 1|x] =
0.1 if x ∈ [0.2,0.4] ∪ [0.6,0.8]
and P[y = 0|x] = 1 − P[y = 1|x].
Write a function that draws m pairs of (x,y) according to the distribution P. Use it to draw a sample of size m = 100 and create a plot:
- Plot the points and their label (have the y axis in range −0.1,1.1 for clarity of presentation).
- Mark the lines x = 0.2,0.4,0.6,0.8 clearly on the plot.
- Run the find_best_interval function on your sample with k = 3, and plot the intervals clearly.
- Note that here, we know the true distribution P, so for every given hypothesis h ∈ Hk, we can calculate error(h) precisely. What is the hypothesis with the smallest error?
- Write a function that, given a list of intervals, calculates the error of the respective hypothesis. Then, for k = 3, m = 10,15,20,…,100, perform the following experiment T = 100 times: (i) Draw a sample of size m and run the ERM algorithm on it; (ii) Calculate the empirical error for the returned hypothesis; (iii) Calculate the true error for the returned hypothesis. Plot the average empirical and true errors, averaged across the T runs, as a function of m. Discuss the results. Do the empirical and true error decrease or increase in m? Why?
- Draw a data set of m = 1500 samples. Find the best ERM hypothesis for k = 1,2,…,20, and plot the empirical and true errors as a function of k. How does the error behave? Define k∗ to be the k with the smallest empirical error for ERM? Does this mean the hypothesis with k∗ intervals is a good choice?
- Now we will use the principle of structural risk minimization (SRM), to search for a k that gives good test error. Let δ = 0.1:
- Use your results from question 3 in the theoretical part and the VC confidence bound to construct a penalty function.
- Draw a data set of m = 1500 samples, run the experiment in (d) again, but now plot two additional lines as a function of k: 1) the penalty for the best ERM hypothesis and 2) the sum of penalty and empirical error.
- What is the best value for k in each case? is it better than the one you chose in (d)?
(f) Here we will use holdout-validation to search for a k that gives good test error. Draw a data set of m = 1500 samples and use 20% for a holdoutvalidation. Choose the best hypothesis based on 3 experiments. Discuss how close this gets you to finding the hypothesis with optimal true error.







