Should neural nets be deeper the more complex the learning problem is?

Question

I know it's not an exact science. But would you say that generally for more complicated tasks, deeper nets are required?

The gist is that you can't just pick any numbers for width and depth and expect optimal performance. Picking the right size network is somewhat trial-and-error, and somewhat of an art. — MaxW, Apr 28 '20 at 23:13

nbro · Answer 1 · 2020-05-12T12:49:40.957

Deeper models can have advantages (in certain cases)

Most people will answer "yes" to your question, see e.g. Why are neural networks becoming deeper, but not wider? and Why do deep neural networks work well?.

In fact, there are cases where deep neural networks have certain advantages compared to shallow ones. For example, see the following papers

The Power of Depth for Feedforward Neural Networks (2016) by Ronen Eldan and Ohad Shamir
Benefits of depth in neural networks (2016) by Matus Telgarsky.
Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks (2017) by Safran and Shamir
Optimal approximation of piecewise smooth functions using deep ReLU neural networks (2018) by Petersen and Voigtlaender

What about the width?

The following papers may be relevant

Wide Residual Networks (2017) by Sergey Zagoruyko and Nikos Komodakis
The Expressive Power of Neural Networks: A View from the Width (2017) by Zhou Lu et al.

Bigger models have bigger capacity but also have disadvantages

Vladimir Vapnik (co-inventor of VC theory and SVMs, and one of the most influential contributors to learning theory), who is not a fan of neural networks, will probably tell you that you should look for the smallest model (set of functions) that is consistent with your data (i.e. an admissible set of functions).

For example, watch this podcast Vladimir Vapnik: Statistical Learning | Artificial Intelligence (AI) Podcast (2018), where he says this. His new learning theory framework based on statistical invariants and predicates can be found in the paper Rethinking statistical learning theory: learning using statistical invariants (2019). You should also read "Learning Has Just Started" – an interview with Prof. Vladimir Vapnik (2014).

Bigger models have a bigger capacity (i.e. a bigger VC dimension), which means that you will more likely overfit the training data, i.e., the model may not really be able to generalize to unseen data. So, in order not to overfit, models with more parameters (and thus capacity) will also require more data. You should also ask yourself why people use regularisation techniques.

In practice, models that achieve state-of-the-art performance can be very deep, but they are also computationally inefficient to train and they require huge amounts of training data (either manually labeled or automatically generated).

Moreover, there are many other technical complications with deeper neural networks, for example, problems such as the vanishing (and exploding) gradient problem.

Complex tasks may not require bigger models

Some people will tell you that you require deep models because, empirically, some deep models have achieved state-of-the-art results, but that's probably because we haven't found cleverer and more efficient ways of solving these problems.

Therefore, I would not say that "complex tasks" (whatever the definition is) necessarily require deeper or, in general, bigger models. While designing our models, it may be a good idea to always keep in mind principles like Occam's razor!

A side note

As a side note, I think that more people should focus more on the mathematical aspects of machine learning, i.e. computational and statistical learning theory. There are too many practitioners, who don't really understand the underlying learning theory, and too few theorists, and the progress could soon stagnate because of a lack of understanding of the underlying mathematical concepts.

To give you a more concrete idea of the current mentality of the deep learning community, in this lesson, a person like Ilya Sutskever, who is considered an "important and leading" researcher in deep learning, talks about NP-complete problems as if he doesn't really know what he's talking about. NP-complete problems aren't just "hard problems". NP-completeness has a very specific definition in computational complexity theory!

I'm having trouble following this answer. Are you saying that the answer might be "yes" in practice, but in theory we might find ways to solve more complex problems with shallower nets? — The Guy with The Hat, Apr 28 '20 at 05:39
@TheGuywithTheHat Roughly, yes. Current methods that have solved "complex tasks" aren't particularly clever and don't scale well. If you have to use huge amounts of data or spend so much time and resources to train a net, or if training pollutes the environment so much, you're probably not doing the cleverest thing. We still don't understand many things about NNs. With a more solid understanding of learning theory, we will be able to come up with cleverer solutions. — nbro, Apr 28 '20 at 12:24
Actually there is evidence that increasing the number of parameters can, paradoxically, reduce the overfitting error. One paper I looked at presented a theory that the smallest model that can be trained to fit the data represents a local maximum in generalization error: you either want a smaller model (to reduce VC dimension) or a larger model (for reasons that are not entirely clear; perhaps an overparametrized model ends up behaving a bit like an ensemble when trained with current methods). But mostly this chain of reasoning is concerned with the width rather than the depth of models. — Charles Staats, Apr 28 '20 at 20:43
@CharlesStaats Can you provide the title of and link to the paper? — nbro, Apr 28 '20 at 20:45
[DEEP DOUBLE DESCENT: WHERE BIGGER MODELS AND MORE DATA HURT](https://arxiv.org/pdf/1912.02292.pdf) — Charles Staats, Apr 28 '20 at 23:43

score 2 · Answer 2 · answered Apr 27 '20 at 15:49

Deeper networks have more learning capacity in the sense that they can fit to more complex data. But at the same time, they are also more prone to overfitting the training data and therefore fails to generalize to the test set.

Apart from overfitting, exploding/vanishing gradients is another problem which hampers convergence. This can be addressed by normalizing the initialization and normalizing the intermediate layers. Then you can do backpropagation with stochastic gradient descent (SGD).

When deeper networks are able to converge, another problem of 'degradation' has been detected. The accuracy saturates and then starts to degrade. This is not caused by overfitting. In fact, adding more layers here leads to higher training error. A possible fix is to use ResNets (residual networks), which have been shown to decrease 'degradation'

score 2 · Answer 3 · edited Apr 27 '20 at 20:32

My experience from a tactical standpoint is to start out with a smaller simple model first. Train the model and observe the training accuracy and validation loss and validation accuracy. My observation is that, to be a good model, your training accuracy should achieve a value of at least 95%. If it does not, then try to optimize some of the hyper-parameters. If the training accuracy does not improve, then you may try to incrementally add more complexity to the model. As you add more complexity the risk of overfitting, vanishing or exploding gradients becomes higher.

You can detect overfitting by monitoring the validation loss. If as the model accuracy goes up the validation loss on later epochs starts to go up you are overfitting. At that point, you will have to take remedial action in your model like adding dropout layers and use regularizers. Keras documentation is here.

As pointed out in the answer by nbro, the theory addressing this issue is complex. I highly recommend the excellent tutorial on this subject which can be found on YouTube here.

The Guy with The Hat · Answer 4 · 2020-04-28T05:05:21.250

Speaking very generally, I would say that with the current state of machine learning, a "more complicated" task requires more trainable parameters. You can increase parameter count by either increasing width and also by increasing depth. Again, speaking very generally, I would say that in practice, people have found more success by increasing depth than by increasing width.

However, this depends a lot on what you mean by "more complicated". I would argue that generating something is a fundamentally more complicated problem than just identifying something. However, a GAN to generate a 4-pixel image will probably be far more shallow than the shallowest ImageNet network.

One could also make an argument that the definition of complexity of a deep learning task is "more layers needed == more complicated", in which case it's obvious that by definition, a more complicated task requires a deeper net.