How does the Dempster-Shafer theory differ from Bayesian reasoning? How do these two methods handle uncertainty and compute posterior distributions?
1 Answers
Demster-Shafer Theory and Bayesian Networks were both techniques that rose to prominence within AI in the 1970's and 1980's, as AI started to seriously grapple with uncertainty in the world, and move beyond the sterilized environments that most early systems worked in.
In the 1970's and perhaps even earlier, it became apparent that direct applications of probability theory to AI were not going to work out because of the curse of dimensionality. As more variables needed to be considered in a given problem, the amount of storage space and processing time needed increased exponentially. This led to a search for new methods to handle uncertainty within AI.
Bayesian Networks and Bayesian Learning remained firmly rooted in probabilistic reasoning, but allowed for the assignment of subjective priors to probabilities, to incorporate expert knowledge. It also allowed problems to be factored into graphical structures to avoid the curse of dimensionality in most cases.
Dempster-Shafer was a further generalization of Bayesian Networks, in which malformed probability distributions were permitted as a way to capture uncertainty. So, for example, the probability of all possible events was not required to add up to 1, because there might be events we don't know about. While on the surface this might seem reasonable, most modern AI researchers view this as a deeply flawed approach. Cheeseman's criticism of DS and other non-probabilistic methods is the basis from which a lot of this view stems. Judea Pearl was another harsh and influential critic of DS Theory.
The basic difference in the fusion of new information is that in Bayesian Networks, after observing new evidence $$E$$, we apply Bayes' rule:
$$ P(H | E) = P(E | H) * P(H) / P(E) $$
to yield a posterior for every hypothesis.
In DS theory, we look for overlap between the worlds suggested by the new evidence and the old data. This can lead to non-sensical results.
Here's an example:
Our prior belief is that our Robot is located at position (0,1) with probability 0.95, and position (0,2) with probability 0.05.
A new signal appears. The signal indicates that robot is at position (0,0) with probability 0.95, and position (0,2) with probability 0.05.
Under Bayes rule, we consider the probability that these signals were generated under each of our original hypotheses, and the probability of observing these signals at all, as shown in the equation above. Under DS-Theory, we would do the same thing.
However, DS-theory provides a second way to interpret the signal: as a second prior distribution, rather than as evidence. We can then combine this second prior with the first, to compute a sort of joint-prior:
$$P( H_{A,B}) \propto P(H_A) * P(H_B)$$
That is, the "probability" (it's not always a true probability, which is one of the criticisms) of a hypothesis after the fusion will be the product of the "probabilities" of the hypothesis under each of the separate priors.
In the example above, this gives a wacky result: the "joint prior" says the Robot is at (0,2) with probability 1.0. This and other problems are why this mode of information combination has mostly been abandoned. There are many more examples on the wikipedia page for DS.
I think there's a discussion of this in more detail in Section IV of Russell & Norvig, at the end of one of the chapters.

- 39,006
- 12
- 98
- 176

- 9,147
- 1
- 17
- 52