How to estimate the capacity of a neural network?

Question

Is it possible to estimate the capacity of a neural network model? If so, what are the techniques involved?

nbro · Answer 1 · 2020-11-13T21:31:40.457

VC dimension

A rigorous measure of the capacity of a neural network is the VC dimension, which is intuitively a number or bound that quantifies the difficulty of learning from data.

The sample complexity, which is the number of training instances that the model (or learner) must be exposed to in order to be reasonably certain of the accurateness of the predictions made given some data, is proportional to this number.

The paper VC Dimension of Neural Networks (1998) by Eduardo D. Sontag provides a good introduction to the VC dimension of neural networks (even though these concepts are quite abstract and you may need to read them several times to fully grasp them). The information in this answer is highly based on that paper.

Shattering and VC dimension

In section 2, Concepts and VC Dimension, he describes the basic concepts behind the VC dimension (not only for neural networks), such as the concept of shattering (i.e. what does it mean for a set of sets to shatter another set?), which is a well-known concept in computational learning theory and is used to define the VC dimension (see definition 2), so you definitely need to get familiar with this concept to understand the VC dimension and, therefore, the capacity of a neural network (calculated with the VC dimension).

VC dimension of functions and neural networks

He then provides an equivalent definition of the VC dimension but for functions (equation 6). Given that neural networks represent functions, then we can also define the VC dimension of a neural network. A specific combination of weights of neural networks represents a specific function, for which the VC dimension can be defined. To be more precise, a parametrized function (and a neural network) can be denoted as

$$ \beta : \mathbb{W} \times \mathbb{U} \rightarrow \mathbb{R} $$

where $\mathbb{W} = \mathbb{R}^p$ and $p$ is the number of weights (or parameters) of the neural network, $\mathbb{U}$ is the input space and $\mathbb{R}$ the output space. So, in this case, $\beta$ can also represent a neural network, with a certain parameter space $\mathbb{W}$, an input space $\mathbb{U}$ and an output space $\mathbb{R}$.

The vector $\mathbf{w} = (w_1, \dots, w_p) \in \mathbb{W}$ represents a specific combination of weights of the neural network, so it represents a specific function. The set of all functions for each choice of this weight vector can be denoted as

$$ \mathcal{F}_{\beta} = \{ \beta(\mathbf{w}, \cdot) \mid \mathbf{w} \in \mathbb{W} \} $$

The VC dimension (VCD) of $\beta$ can then be defined as

$$ \text{VCD}(\beta) := \text{VCD}(\mathcal{F}_{\beta}) $$

Therefore, the VC dimension is a measure of the capacity of a neural network with a certain architecture. Moreover, the VC dimension is equivalently defined for a certain set of functions associated with a neural network.

How to calculate the VC dimension?

To calculate the actual VC dimension of a neural network, it takes a little bit of more creativity. Therefore, I will just report the VC dimension of some neural networks. For more details, you should fully read the cited paper (more than once) and other papers and books too (especially, the ones described in this answer, which provide an introduction to CLT concepts).

VC dimension of a perceptron

The VC dimension of a perceptron is $m + 1$, where $m$ is the number of inputs. Given that a perceptron represents a linear and affine function, the VC dimension of the perceptron is also equal to the number of parameters. However, note that, even though the VC dimension of the perceptron is linear in the number of parameters and inputs, it doesn't mean the perceptron can learn any function. In fact, perceptrons can only represent linear functions. See section 3.1 of VC Dimension of Neural Networks for more details.

VC dimension of a single hidden layer neural network

Let $n$ be the number of hidden units, then the VC dimension of a single hidden layer neural network is less than or equal to $n+1$. See section 3.2 of VC Dimension of Neural Networks for more details.

VC dimension of multi-layer neural networks with binary activations

The VC dimension of multi-layer neural networks (MLPs) with binary activations and $p$ weights (or parameters) is $\mathcal{O}(p \log p)$. See theorem 4 (and related sections) of the paper VC Dimension of Neural Networks for more details.

VC dimension of MLPs with real-valued activations

The VC dimension of MLPs with real-valued activations is no longer bounded by $\mathcal{O}(p \log p)$ and can be exponential in the number of parameters. See section 5.3 of VC Dimension of Neural Networks.

VC dimension of MLPs with linear activations

The VC dimension of MLPs with linear activations is $\mathcal{O}(p^2)$. See theorem 5 of the paper VC Dimension of Neural Networks.

Notes

The VC dimension is often expressed as a bound (e.g. with big-O notation), which may not be strict.

In any case, the VC dimension is useful because it provides some guarantees. For example, if you use the VC dimension to describe an upper bound on the number of samples required to learn a certain task, then you have a precise mathematical formula that guarantees that you will not need more samples than those expressed by the bound in order to achieve a small generalization error, but, in practice, you may need fewer samples than those expressed by the bound (because these bounds may not be strict or the VC dimension may also not be strict).

There is a more recent paper (published in 2017 in MLR) that proves new and tighter upper and lower bounds on the VC dimension of deep neural networks with the ReLU activation function: Nearly-tight VC-dimension bounds for piecewise linear neural networks. So, you probably should read this paper first.

The paper On Characterizing the Capacity of Neural Networks using Algebraic Topology may also be useful and interesting. See also section 6, Algebraic Techniques, of the paper I have been citing: VC Dimension of Neural Networks.

The capacity of a neural network is clearly related to the number of functions it can represent, so it is strictly related to the universal approximation theorems for neural networks. See Where can I find the proof of the universal approximation theorem?.

the value of n+1 for a single hidden layer neural network holds only if input weights and biases are constant. The model then becomes an affine function of output weights which is equal to n+1 (including bias term c0) ~ VC-dim = n+1 — Maharshi Roy, Jul 25 '21 at 07:15

Anon · Answer 2 · 2020-02-11T05:45:22.643

Most methods for measuring the complexity of neural networks are fairly crude. One common measure of complexity is VC dimension, a discussion which can be found here and here. For example, neural networks have a VC dimension that is too large to give a strong upper bound on the number of training samples needed for a model (the upper bound provided by VC analysis is much higher than what we have observed neural networks to be able to generalize from).

Another common measure of capacity is the number of parameters. We see in the paper "Understanding deep learning requires rethinking generalization", published at ICLR with over 1400+ citations, that networks with more parameters than data often have the capacity to memorize the data. The paper provides compelling evidence that traditional approaches to generalization provided by statistical learning theory (VC dimension, Rademacher complexity) are unable to fully explain the apparent capacity of neural networks. In general, neural networks seem to have a large capacity, given the apparent good performance on certain tasks.

Beyond these ideas, the universal approximation theorem tells us that the set of neural networks can approximate any continuous function arbitrarily well, which strongly suggests that any neural network has a large capacity.