I once heard that the problem of approximating an unknown function can be modeled as a communication problem. How is this possible?
1 Answers
Information-theoretic view of Bayesian learning
I once heard that the problem of approximating an unknown function can be modeled as a communication problem. How is this possible?
Yes, this is indeed possible. More precisely, there is an information-theoretic view of Bayesian learning in neural networks, which can also be thought of as a communication problem, which explains both maximum a posteriori estimation (MAPE) and full Bayesian learning [1], i.e. finding the posteriors over the weights of the neural network: the neural networks that maintain a probability distribution over the weights are now known as Bayesian neural networks (and, in terms of theory, they are strongly related/similar to the famous variational auto-encoders).
The oldest relevant paper (I am aware of) that interprets Bayesian learning in neural networks as a communication problem is the 1993 paper by Hinton and Van Camp entitled Keeping the neural networks simple by minimizing the description length of the weights (COLT), which is the paper that introduces variational Bayesian neural networks (sometimes called ensemble learning in some papers from the 1990s), i.e. variational inference (VI) applied to neural networks (yes, the same VI used in VAEs). Hinton (yes, the famous Hinton that won the Turing award) and Van Camp (who is this? probably a Dutch guy from the name!) write in this paper
We can think in terms of a sender who can see both the input vector and the correct output and a receiver who can only see the input vector. The sender first fits a neural network, of pre-arranged architecture, to the complete set of training cases, then sends the weights to the receiver. For each training case, the sender also sends the discrepancy between the net's output and the correct output. By adding this discrepancy to the output of the net, the receiver can generate exactly the correct output.
You should read this seminal paper if you want to understand all the details.
Another relevant paper is Practical Variational Inference for Neural Networks (2013, NeurIPS) by Graves, who cites the 1993 paper immediately at the beginning of the paper. Essentially, as the title of the paper suggests, Graves tries to make VI in neural networks practical.
There are other relevant papers that still attempt to provide this information-theoretic view of Bayesian learning, such as Variational learning and bits-back coding: An information-theoretic view to bayesian learning (2004, IEEE Transactions on Neural networks), but most current papers on Bayesian neural networks, such as Weight Uncertainty in Neural Networks (2015, PMLR) don't do it (at most they may mention that this interpretation exists, but they don't go into the details).
Minimum description length
To give you a few more details, the information-theoretic view of Bayesian learning in these papers is that of the minimum description length (MDL), i.e. Bayesian learning (i.e. the application of Bayes rule to find the posteriors over the parameters of the model) is equivalent to finding a model that gives the "shortest description of the data" (hence the name MDL), where a description is some code/encoding of the data: in the case of the NNs, this encoding is contained in their weights.
Given that you want to find the simplest code, then this is a direct application of Occam's razor: if you have multiple hypotheses/functions that describe your data (or are consistent with your observations), then choose the simplest one. Occam's razor underlies many other mathematical/ML theories and frameworks, for example, AIXI, a framework for artificial general intelligence developed by Marcus Hutter. Jürgen Schmidhuber is also a good fan of Occam's razor and compression as a means to act intelligently (see e.g. the speed prior). If you are familiar with deep learning, a light bulb should turn on in your brain now. Yes, regularization techniques to avoid over-fitting and improve generalization can also be viewed as an application of Occam's razor principle.
Bits-back coding
How do we find the simplest weights? The bits-back coding, used by the 1993 paper and described in the 2004 and 2013 papers, essentially states that you can find the simplest encoding (i.e. posterior over the weights) by minimizing the Kullback-Leibler divergence (aka relative entropy: say what?!) between the posterior (which is unknown: so how can we compute the KL divergence?) and some prior (coding distribution), which is zero when the prior is equal to the posterior (but we don't know the posterior) [1]. Given that we don't know the posterior, we need to use a proxy objective function that doesn't involve the posterior, such as the Evidence Lower BOund (ELBO), also known as variational free-energy, which leads to a non-optimal coding (i.e. possibly, you will find some posteriors that are not optimal given the data).
Conclusions
Using MAPE or performing (approximate) Bayesian learning in a neural network (which finds one function or a probability distribution over functions, respectively) can be interpreted as finding the MDL, i.e. an optimal or near-optimal encoding of the data that needs to be communicated from a sender to a receiver.
Side notes
Information theory was pioneered by Claude Shannon in his 1948 seminal paper A Mathematical Theory of Communication.
Claude Shannon was also one of the participants at the Dartmouth workshop, which officially started the field of artificial intelligence, so he is one of the fathers of the AI field, and his impact on the field is definitely huge (although most people are not aware of it, but, hopefully, this answer will change that).
Further reading
Apart from the papers that I cited above, you may also be interested in Information Theory and its Relation to Machine Learning (2015) by Hu.
- 39,006
- 12
- 98
- 176