1

Let $H_1$ , $H_2$ ,... be a sequence of hypothesis classes for binary classification.

Assume that there is a learning algorithm that implements the ERM rule in the realizable case such that the output hypothesis of the algorithm for each class $H_n$ only depends on $O(n)$ examples out of the training set. Furthermore, assume that such a hypothesis can be calculated given these $O(n)$ examples in time $O(n)$, and that the empirical risk of each such hypothesis can be evaluated in time $O(mn)$.

For example, if $H_n$ is the class of axis aligned rectangles in $R^n$ , we saw that it is possible to find an ERM hypothesis in the realizable case that is defined by at most $2n$ examples.

Prove that in such cases, it is possible to find an ERM hypothesis for $H_n$ in the unrealizable case in time $O(mnm^{O(n)})$.

John Doucette
  • 9,147
  • 1
  • 17
  • 52
Ben
  • 133
  • 5
  • I've added the "homework" tag to this problem because it looks like it is a homework problem from the book "Understanding Machine Learning" by Shalev-Shwartz & Ben-David. I'm guessing you would like a hint for this problem, rather than a fully worked answer? – John Doucette Oct 11 '19 at 02:13
  • 1
    @JohnDoucette Not absolutely. I'm just reading the book. I came across the exercise, and I also have the solution to the exercise, but I cannot understand the solution. Hope you can help me! – Ben Oct 11 '19 at 02:20

1 Answers1

0

Here's a rough proof sketch that might be clearer, but is probably less precise, than the solution you already have.

  1. We have a sequence of hypotheses classes, but the problem actually only asks us to consider a single class. A hypothesis class is a set of possible models for classification that differ only in a set of parameter values.

  2. We are told that the algorithm for selecting a hypothesis from $H_n$ uses only $O(n)$ samples from the training set, and that using those samples, it takes the algorithm $O(n)$ to find the hypothesis that minimizes the empirical risk for the class, provided that the ERM minimizing hypothesis has an empirical risk of 0 (i.e. it is realizable).

  3. Further, if we pick some hypothesis at random from the class, we are told that the cost of determining its exact empirical risk is $O(nm)$, where $m$ is the total size of the training set, and $n$ is just a parameter associated with this particular hypothesis class $H_n$.

  4. Observer that, in the unrealizable case, we cannot use the algorithm to find the risk minimizing hypothesis (which, by 2, runs in O(n)). No other algorithm is mentioned, but we are also told (in 2) that the algorithm can construct a hypothesis from $O(n)$ datapoints in $O(n)$ time, even if no feasible hypothesis can be found for those $O(n)$ points.

  5. Suppose we wanted to find the hypothesis with minimal risk via brute force. We could compute every subset of the training dataset that is of the size $O(n)$ that the algorithm may be able to use to construct a hypothesis in $O(n)$ time. There are $O(m^{O(n)})$ such subsets (e.g. $m*(m-1)/2$ possible pairs, $m*(m-1)*(m-2)/6$ possible triplets, etc.).

  6. Since there are $O(m^{O(n)})$ possible subsets of size $O(n)$, and our algorithm can compute a candidate hypothesis in $O(n)$ time, it costs us $O(nm^{O(n)})$ to find all possible hypotheses that our algorithm can produce for this dataset.

  7. Since it costs us $O(nm)$ time to compute the emperical risk of a hypothesis generated by this algorithm, we can minimize risk by computing risk for each of the $O(m^{O(n)})$ hypotheses we made in step 6, and then selecting the one with lowest empirical risk. This costs us $O(mnm^{O(n)})$ time total. Our total runtime is then the sum of the cost to generate the models in step 6, and the cost to evaluate them here in step 7: $O(mnm^{O(n)})+O(nm^{O(n)})$.

  8. Since $O(mnm^{O(n)})+O(nm^{O(n)}) \in O(mnm^{O(n)})$, we have proven the desired bound on runtime.

John Doucette
  • 9,147
  • 1
  • 17
  • 52