10

BACKGROUND: Ensemble classifiers are said to reduce bias by taking an "average" of predictions of several base classifiers that comprise the ensemble. However, I am uncertain if this necessarily means that they can increase accuracy. My intuition tells me that the ensemble classifier should perform no better and possibly even worse than the best base classifier in the ensemble. This seems especially true for bagging approaches which use strong classifiers anyway. When you have a "star performer", it just doesn't seem to make intuitive sense to "dilute" its performance with subpar performers.

Nonetheless, from my novice-level reading, it seems that ensembles can be as good or possibly even better than all of the individual base classifiers, but I'm still not clear why.

QUESTION: How can an ensemble be more accurate than the best base classifier in that ensemble?

Snehal Patel
  • 912
  • 1
  • 1
  • 25

1 Answers1

15

Lets consider binary classification.

Imagine you have an ensemble made up of $K$ models. Assume each model has exactly $51\%$ accuracy. Further assume each model's error is uncorrelated with each other model. The ensemble simply takes a majority vote.

For $K = 11$, what is the probability we get a single question right? Note: This problem can be simply modeled as flipping an unfair coin $K$ times and calculating the probability that one side will be the majority of flips.

$P(correct\ \geq\ 6)\ =\ \sum_{k=6}^{11}{\binom{11}{k}({0.51}^k)({0.49}^{11-k})= 0.53}$

What about when $K = 101$?

$P(correct\ \geq\ 51)\ =\ \sum_{k=51}^{101}{\binom{101}{k}({0.51}^k)({0.49}^{101-k})= 0.58}$

If you do the math you will find that as $K$ approaches $\infty$, the odds of the ensemble being correct asymptotically approaches $100\%$.

$\lim_{K\rightarrow\infty}{P\left(correct\ \geq\frac{K+1}{2}\right)}={lim}_{K\rightarrow\infty}\left(\sum_{k=\frac{K+1}{2}}^{K}{\binom{K}{k}({0.51}^k)({0.49}^{K-k}})\right)=1$

Now, we made 2 key assumptions:

  1. Errors are uncorreleated between models
  2. Each model has more than $50\%$ accuracy

In practice these assumptions often may not hold, especially the first one. But if they do, this simple probability analysis shows ensembles will be better

Snehal Patel
  • 912
  • 1
  • 1
  • 25
chessprogrammer
  • 2,215
  • 2
  • 12
  • 23
  • 1
    Note that your models having more than 50% accuracy is a very weak assumption because any model that does worse than random can be used to construct a model that does better than random. – Oscar Smith Nov 28 '22 at 21:34
  • @OscarSmith, can you explain how that is possible, maybe along with a concrete example? – Snehal Patel Nov 28 '22 at 22:15
  • 2
    you take the model and make a new model that predicts the opposite. (for the n way version it's a bit harder, but you can make a model that chooses a random model other than the one your original model produced) – Oscar Smith Nov 28 '22 at 22:29
  • Sorry, I may have missed something here. Is it implied somewhere in the first comment (Note that your models having more than 50% accuracy is a very weak assumption because any model that does worse than random can be used to construct a model that does better than random.) that it is referring to a multiclass situation? – Snehal Patel Nov 28 '22 at 23:52
  • The generalization of the 50% accuracy condition (which is better than random) is the weak assumption. – Oscar Smith Dec 04 '22 at 19:46