4

In Chapter 9, section 9.1.6, Raul Rojas describes how committees of networks can reduce the prediction error by training N identical neural networks and averaging the results.

If $f_i$ are the functions approximated by the $N$ neural nets, then:

$$ Q=\left|\frac{1}{N}(1,1, \ldots, 1) \mathbf{E}\right|^{2}=\frac{1}{N^{2}}(1,1, \ldots, 1) \mathbf{E} \mathbf{E}^{\mathrm{T}}(1,1, \ldots, 1)^{\mathrm{T}}\tag{9.4}\label{9.4} $$ is the quadratic error of the average of the networks, where $$ \mathbf{E}=\left(\begin{array}{cccc} e_{1}^{1} & e_{2}^{1} & \cdots & e_{m}^{1} \\ \vdots & \vdots & \ddots & \vdots \\ e_{1}^{N} & e_{2}^{N} & \cdots & e_{m}^{N} \end{array}\right), $$ and $\mathbf{E}$'s rows are the errors of the approximations of the $N$ functions, i.e. $\mathbf{e}^{i} = f_i(\mathbf{x}^{i}) - t_i$, for each of the input-output pairs $\left(\mathbf{x}^{1}, t_{1}\right), \ldots,\left(\mathbf{x}^{m}, t_{m}\right)$ used in training.

Is there a way to assure that the errors for a neural network are uncorrelated to the errors of the others?

Raul Rojas says that the uncorrelation of residual errors is true for a not too large $N$ (i.e. $N < 4$). Why is that?

EmmanuelMess
  • 207
  • 3
  • 14
  • Maybe these notes [https://people.math.umass.edu/~johnpb/s706/notes_mixed.pdf](https://people.math.umass.edu/~johnpb/s706/notes_mixed.pdf) could be useful. – nbro Jan 20 '21 at 13:51
  • I think that the tag that you actually want to use is [tag:ensemble-learning] and not [tag:committees-of-networks], which I've never heard of, but it's possible it exists or has been used to refer to this approach. – nbro Jan 20 '21 at 16:12
  • @nbro section 9.1.6 is dedicated to it. Ensemble seems to be mixing different networks, or models. Committees is for identical networks, trained with the same data. – EmmanuelMess Jan 20 '21 at 16:31
  • @nbro thanks for the pdf, but I don't understand how I should use that information to assure uncorrelated errors, is there an algorithm name I can look up (that use the Unconditional or marginal model or the Nonlinear mixed model as those are the nonlinear methods)? – EmmanuelMess Jan 20 '21 at 16:35
  • 2
    I don't think that [ensemble learning](http://www.scholarpedia.org/article/Ensemble_learning) is restricted to different networks. Nowadays, it's typically used to indicate that you train multiple models and then combine them somehow, but I'm also not an expert in ensemble learning, to be honest. In any case, I've just looked at that chapter 9.1.6 again and he cites [399], which is a paper entitled ["When Networks Disagree: Ensemble Methods for Hybrid Neural Networks"](https://apps.dtic.mil/sti/pdfs/ADA260045.pdf). – nbro Jan 20 '21 at 16:36
  • The book you're reading is not very new (although still a good book, as far as I'm concerned), so, at the time of writing that book, maybe "ensemble learning" was not very used yet as a standard term. – nbro Jan 20 '21 at 16:36
  • 1
    Regarding the link above, I don't really know if it could help you. It talks about the correlation of errors in linear regression, so I thought it could be useful, at least, to understand what R. Rojas meant by "correlation". You could also interpret neural networks as performing non-linear regression, so that's another reason why I provided the link to that pdf, which, to be honest, I have only skimmed through. – nbro Jan 20 '21 at 16:39
  • its a theoretical question about a practical approach. the payoff from this technique would tend to depend in part on the data, the network architecture, the features extracted, etc... in other words, the answer would mostly come from experiments in particular situations. basically there is some randomness in training also in which different nets may "find" or "emphasize" different actual features of the data. – vzn Jan 20 '21 at 20:04
  • @vzn Should I try to make the starting weights be far apart from network to network? something like randomly nudge weights so that they are different from the other networks' weights? – EmmanuelMess Jan 20 '21 at 20:48
  • yes try starting with different weights and possibly train the nets on different samples of the same data, ie try data segregation ideas. also Rojas was saying that it can work for as low as N=2 or N=3 but that doesnt mean to keep N in that range. the point is N varies for each situation and just test to find what works well/ is practical etc. you can ping me in [chat] or this chat room for further discussion https://chat.stackexchange.com/rooms/9446/theory-salon – vzn Jan 20 '21 at 21:38

0 Answers0