2

I have seen that, usually, the dropout layer is used differently in training and evaluation modes, i.e. it is recommended to use during training but not in evaluation/testing.

Dropout does remove a few nodes at random so that model does not end up in co-adaption. But, logically, if you are using one layer in training and not in evaluation/testing, should not the result be inconsistent? How/ why do we achieve the same/similar results though we are skipping a layer altogether?

nbro
  • 39,006
  • 12
  • 98
  • 176
prat__
  • 33
  • 4

1 Answers1

1

How/why do we achieve the same/similar results though we are skipping a layer altogether

Dropout is not a layer, even tough deep learning libraries implement it as a layer module for convenience.

Why do we achieve same results? We don't, that's why dropout is applied only during training and not during test. And the fact that results change is also the core idea of dropout as a regularization technique. When we train a model we want the model to be robust, i.e. similar input data should lead to same predictions, but due to over fitting this is almost never the case. Over fitting comes from several sources, the one addressed by dropout is preventing some weights of a model becoming too large and others becoming too small.

Let's say our model is a simple equation like:

$w_1*x_2 + w_2*x_2 + b = y$

Where x1, x2 are two features and w1, w2 are the associate weights. It might be that the model starts over fitting, for example w1 might become too large and w2 to small, then our model will learn to focus only on feature x1, ignoring x2. By randomly "dropping" w1, we'll force the model to focus also on x2 as a valuable feature, preventing w2 becoming too small. Because of the randomness of dropout, the weights will converge to an optimal solution for both, not just one of them, so when applying both weights the prediction will be ideally more robust than when using only part of them. Of course in real use cases models never converge to a perfect minima where part of the weights lead to exactly the same predictions as the whole model, so in test phase dropout is disable to guarantee same predictions for same training instances every time.

There is though a nice example of dropout used during test case in Deep Active Learning. Active dropout in test phase can be leveraged to perform Monte Carlo sampling of different probability scores for a single instance. The sampled probability can be then used to compute statistics like standard deviation, which can be used as an approximation of the aleatory uncertainty of the model regarding that particular instance.

Edoardo Guerriero
  • 5,153
  • 1
  • 11
  • 25