How can neural networks approximate any continuous function but have $\mathcal{VC}$ dimension only proportional to their number of parameters?

Question

Neural networks typically have $\mathcal{VC}$ dimension that is proportional to their number of parameters and inputs. For example, see the papers Vapnik-Chervonenkis dimension of recurrent neural networks (1998) by Pascal Koirana and Eduardo D. Sontag and VC Dimension of Neural Networks (1998) by Eduardo D. Sontag for more details.

On the other hand, the universal approximation theorem (UAT) tells us that neural networks can approximate any continuous function. See Approximation by Superpositions of a Sigmoidal Function (1989) by G. Cybenko for more details.

Although I realize that the typical UAT only applies to continuous functions, the UAT and the results about the $\mathcal{VC}$ dimension of neural networks seem to be a little bit contradictory, but this is only if you don't know the definition of $\mathcal{VC}$ dimension and the implications of the UAT.

So, how come that neural networks approximate any continuous function, but, at the same time, they usually have a $\mathcal{VC}$ dimension that is only proportional to their number of parameters? What is the relationship between the two?

So the definition is that for every $\epsilon$ there exists $N$ width neural network which can approximate to the $\epsilon$ accuracy. It is pretty obvious if $\epsilon = 0$ the width will be infinite. Even definite mathematical series require infinite number of term summations to converege to a function. — , Apr 13 '20 at 00:56

score 0 · Answer 1 · edited Jun 21 '21 at 22:17

0

VC Dimension of Neural Networks establishes VC bounds depending on the number of weights, whereas the UAT refers to a class of neural networks in which the number of weights a particular network can have is not bounded, although it needs to be finite.

I think that we can show, from theorem 2 and the observations below theorem 3 in Approximation by Superpositions of a Sigmoidal Function, that the VC dimension of

$$S=\left\{\sum_{i=1}^N \alpha_i\sigma(y_i^T x + \theta_i) : N\in\mathbb N, \alpha_i, \theta_i \in\mathbb R, y_i\in\mathbb{R}^n \right\}$$

is infinite.

Let $\{(x_i, y_i)\}_{i=1}^k$ be a sample of arbitrary size $k\in\mathbb N$, and let us see that there is a function in $S$ which can correctly classify it, i.e., $S$ shatters $\{x_i\}_{i=1}^k$.

We note $B(x, \varepsilon) := \{ y\in\mathbb{R}^n : d(x,y) < \varepsilon \}$ (this is just standard notation to denote a ball).

First, let $\varepsilon > 0$ be such that $B(x_i, \varepsilon)\cap B(x_j, \varepsilon) = \emptyset$ every time that $i \ne j$.

Now define $D = \cup_{y_i=1} B(x_i, \varepsilon)$. Define $f_{\varepsilon}(x)$ as in the observations below theorem 3 of Cybenko's paper, and use theorem 2 to find a function $G(x)$ in $S$ that classifies correctly all points at least $\varepsilon$ away from the boundary of $D$, i.e., all points in the sample.

edited Jun 21 '21 at 22:17

nbro

39,006
12
98
176

answered Apr 19 '20 at 02:29

Dani

26
3

If you look at Cybenko's paper, you will see that the number of weights isn't really unbounded or infinite. The theorem holds for a finite but arbitrary $N$, where $N$ is the number of units in the hidden layer. The question is: in the limit, i.e. as $N \to \infty$, will the neural network have VC dimension $\infty$? Can you prove this from Cybenko's theorem? – nbro Apr 19 '20 at 12:41
Sorry, I can't see that it holds for a fixed $N$. In particular, proof of theorem 1 requires that $S$ is a subspace of $\mathcal C(I_n)$, which I don't think holds if $N$ is fixed. However, defining $S$ to be the set of (finite but arbitrarily large) linear combinations of that form, it is trivially a subspace. – Dani Apr 20 '20 at 18:34
I didn't say "fixed", I said "arbitrary". – nbro Apr 20 '20 at 18:38
Maybe we aren't understanding each other. Originally, I meant that the class of functions for which the UAT applies is $\{ \sum_{i=1}^N \alpha_i \sigma(y_i^T x + \theta_i) : N\in \mathbb{N}, \alpha_i, \theta_i \in \mathbb{R}, y_i\in \mathbb{R}^n \}$. – Dani Apr 20 '20 at 18:45
Ok. But then why do you say that "the UAT refers to a class of neural networks with _unbounded_ number of weights."? – nbro Apr 20 '20 at 18:48
I meant that the number of weights that a function in the class can have is not bounded, i.e., it contains functions with arbitrarily large number of weights. I will edit the answer to be more clear. – Dani Apr 20 '20 at 18:53
Ok, but your answer still doesn't really try to clarify the relationship between the two, apart from saying that in the case of UAT the number of hidden units can be arbitrarily big and that the VC dimension depends on the number of weights. I already knew this. I wanted to understand more. I wanted to see a formal proof that relates the two. Maybe I should have been more explicit, in fact. I realise now that my question wasn't clear enough. – nbro Apr 20 '20 at 19:48
So, let me try to understand what you're trying to show here. You have $S$, which is a set of functions or, more precisely, the hypothesis class of a neural network. You have some labeled dataset of arbitrary size. You want to show that we can find a function in $S$ that approximates the function represented by the labeled dataset (which you denote, at the end of your answer, as $G(x)$, like in Cybenko's paper). So, **1.** you define some _ball_ (question 1: intuitively, why do you do this?). – nbro Jan 22 '21 at 16:21
Then **2.** you let $\epsilon$ be such that the balls of different inputs do not intersect (question 2: why do you do this?). Then **3.** you define the unions of all balls, i.e. $D$ (question 3: why do you do this? I suppose this is related to Cybenko's proof, which I don't recall anymore, but it would be nice if you can provide more details here about these choices in the proof, so that this answer self-contained; ideally, it would be nice if you recalled Cybenko's proofs or theorems, before proceeding to your theorem and proof). – nbro Jan 22 '21 at 16:22
Then **4.** you say that we should define $f_\epsilon(x)$ in some way and then find $G(x)$ in $S$ as described theorem 2. Then **5.** you **claim** that $G(x)$ can correctly classify all points $\epsilon$-away from the boundary of $D$, i.e. I suppose inside $D$, which you defined as the union of all balls. – nbro Jan 22 '21 at 16:26
However, how is this related to my original question "_how come that neural networks approximate any continuous function, but, at the same time, they usually have a VC dimension that is only proportional to their number of parameters?_"? How is your proof intuitively related to this? Is this because $k$ is arbitrary? I don't see how the VC dimension of neural networks is involved in your proof. I don't see how you're showing that the VC dimension of $S$ (the neural network) is infinite here. – nbro Jan 22 '21 at 16:30

How can neural networks approximate any continuous function but have $\mathcal{VC}$ dimension only proportional to their number of parameters?

1 Answers1