Doesn't every single machine learning classifier use conditional probability/Bayes in its underlying assumptions?

Question

I'm reading about how Conditional Probability/ Bayes Theorem is used in Naive Bayes in Intro to Statistical Learning, but it seems like it isn't that "groundbreaking" as it is described?

If I'm not mistaken doesn't every single ML classifier use conditional probability/Bayes in its underlying assumptions, not just Naive Bayes? We are always trying to find the most likely class/label, given a set of features. And we can only deduce that using Bayes rule since we are (usually) solving for P(class|features) with P(features|class)?

Dr. Snoopy · Answer 1 · 2022-12-28T15:52:36.323

3

Conditional probability and Bayes rule are related but they are not the same thing, you can predict conditional probabilities without using Bayes rule.

So no, not all machine learning classifiers use Bayes rule, standard neural networks do not use Bayes rule at all, SVMs and linear classifiers neither.

A better counterexample is Bayesian Neural Networks, which have a probability distribution over the weights, and Bayes rule is used during learning and inference, these are not the same as standard neural networks.

As reference for this statement, I leave the following quote from Section 3.1 of the the paper Uncertainty Quantification for Deep Neural Networks: An Empirical Comparison and Usage Guidelines:

BNNs are neural networks with probabilistic weights, instead of scalar weights as in PPNN, and are represented as probability density functions. To train a BNN, first, a prior distribution p(θ) over weights θ has to be defined. Then, given some data D, the posterior distribution p(θ|D), i.e., the trained BNN is inferred using Bayes rule:

edited Dec 28 '22 at 15:52

answered Dec 23 '22 at 20:47

Dr. Snoopy

1,142
6
12

In practice, BNNs (that I most familiar with) do not directly use Bayes rules to learn the posterior. They use variational inference (i.e. formulate the inference problem as an optimization problem). So, it's not correct to say that the Bayes rule is used during inference and training of (all) BNNs. – nbro Dec 24 '22 at 10:41
@nbro What you are saying is basically, there is one kind of BNN that uses VI, which means all BNNs use VI, which is incorrect. Even with VI you use the predictive posterior distribution to make prediction which implicitly uses Bayes rule. – Dr. Snoopy Dec 24 '22 at 13:53
I'm not saying there's one kind of BNNs and that all BNNs use VI. I said "BNNs (that I most familiar with)". Maybe the parentheses were not necessary. In other words, I am saying that some BNNs do not use Bayes rule. Your last paragraph seems to suggest that all BNNs use the Bayes rule, which is incorrect. With VI, you do not really use any Bayes rule. You're solving an optimization problem, which is equivalent (up to a constant) to solving the inference problem using the Bayes rule by finding the integrals (with Monte Carlo methods or in closed-form) – nbro Dec 24 '22 at 15:18
@nbro No, that is not correct, even with VI there are priors and posteriors, what you learn is the transformation from one to another, its an approximation, but still it is Bayesian. – Dr. Snoopy Dec 24 '22 at 15:20
Yes, there are priors and posteriors, but that doesn't make it a Bayes rule. You do not apply the Bayes rule. It's Bayesian, yes, because you have priors and posterios, but you do not apply the Bayes rule $p(y \mid x) = \frac{p(x \mid y) p(y)}{ p(x)}$. – nbro Dec 24 '22 at 15:21
@nbro Its in the formulation, you might not see it, but its still there, they are still Bayesian NNs. Maybe ask yourself, how to obtain the posterior given the prior and the model, without any use of Bayes rule. – Dr. Snoopy Dec 24 '22 at 15:23
Note: I'm not saying they are not Bayesian. They are! We put priors on the weights and find the posteriors. That's what usually people refer to as "Bayesian". So, yes, I agree with you. However, we do not (directly) apply the Bayes rule to find the posterior. – nbro Dec 24 '22 at 15:24
@nbro Priors, evidence, and posteriors, related with Bayes rule, is what actually means to be Bayesian. – Dr. Snoopy Dec 24 '22 at 15:25
@nbro What you are implying is that you can be Bayesian without using Bayes rule and I do not think that makes sense. – Dr. Snoopy Dec 24 '22 at 15:28
If you think that doesn't make sense, then you must think that variational BNNs are not Bayesian, because they do not directly multiply the likelihood times the prior divided by the evidence to get the posterior (Bayes rule). Instead, they solve an optimization problem that is equivalent (up to a constant) to minimizing the KL between a variational distribution and the posterior. In practice, when you maximize the ELBO, you max the evidence of your data. I don't see how maximizing the ELBO is equal to the Bayes rule and it's not. The Bayes rule does not even have a variational distribution. – nbro Dec 24 '22 at 15:33
@nbro Consider from where the ELBO comes from, you are just looking at the final derivation and not starting from the beginning, from Bayes rule. – Dr. Snoopy Dec 24 '22 at 15:34
The ELBO comes from an approximation of the problem: KL divergence between the variational distribution and the posterior, which you cannot solve, because you don't know the posterior, which is what you want to find in the first place. That's why people invented the ELBO. The original optimization problem is just a distance between a variational distribution (e.g. Gaussians) and the unknown posterior. How is this equal to the Bayes rule? – nbro Dec 24 '22 at 15:36
I think the most one may say is that, because we're trying to find a posterior, given some prior and likelihood, we're trying to do something that is somehow equivalent to the Bayes rule. – nbro Dec 24 '22 at 15:44
@nbro No, again incorrect, you seem to forget that posterior, prior, likelihood, and evidence, are all terms in Bayes rule. And you brought up VI, my answer does not mention VI, there is also the theoretical concept of a Bayesian Neural Network which by definition uses Bayes rule to propagate inputs to outputs. – Dr. Snoopy Dec 25 '22 at 13:37
What is incorrect? I did my master's thesis on Bayesian neural networks, I did extensive review of the literature (not just VI), I derived many equations (including the ELBO), and I am very familiar with VAEs, as can be confirmed by the fact that I am the top answerer on this site on the topic. Moreover, I didn't say that all BNNs don't use the Bayes rule. In fact, I said that **some BNNs** (read above) don't use the Bayes rule (directly), which is true. You can go on forever defending your thesis, but ok, at this point, this conversation is useless. – nbro Dec 25 '22 at 13:42
Yes, it is useless, I teach this stuff, I write papers, its my main research topic at the post-PhD level. And you only fixated on the ELBO, without mentioning the predictive posterior distribution, which is literally Bayes rule in integral form, and used to make predictions with BNNs. – Dr. Snoopy Dec 25 '22 at 13:51
Yes, I know, I read your profile. You're teaching it incorrectly in my view and you don't fully understand everything. So, to conclude: 1) The Bayes rule is: $p(y \mid x) = \frac{p(x \mid y) p(y)}{p(x)}$, 2) If you know it, variational BNNs usually optimize the ELBO, so they do not **directly** apply the Bayes rule: optimizing the ELBO is not the same thing as applying the Bayes rule. 3) Variational BNNs make use of a variational distribution, which is not even a notion in Bayes rule. 4) I am not saying that all BNNs are variational, so I am not saying that all BNNs don't use the Bayes rule. – nbro Dec 25 '22 at 13:55
5) I am not saying your answer is really incorrect, but just a bit misleading. 6) I agree that one can consider variational BNNs Bayesian because you learn a posterior from a prior and likelihood. However, variational BNNs don't use the Bayes rule directly (I'm just saying what is completely obvious from the definition). – nbro Dec 25 '22 at 13:55
7) You keep on saying that I am incorrect, while you don't fully explain why. You keep saying only because there are priors, likelihoods and posteriors that automatically means we're using the Bayes rule, but to use the Bayes rule, you need to multiply the likelihood by the prior (divided by the evidence). How do you explain the existence of a variational distribution? Where is the variational distribution in the Bayes rule? Nowhere. – nbro Dec 25 '22 at 14:00

score 2 · Answer 2 · answered Dec 23 '22 at 19:33

Probability is one way to solve classification problems. Still, there are other ways like clustering and K nearest neighbor approach where we tend to analyze the position of the current data point and its neighboring points to classify it. Also, in the decision tree classifier, information gain is the core concept used to classify.

Doesn't every single machine learning classifier use conditional probability/Bayes in its underlying assumptions?

2 Answers2