For questions related to AI theory that relies on the knowledge of a distribution of probabilities across one or more dimensions affecting probability. Such a distribution may be in discrete buckets, such as quartile, octile, or percentile conventions or continuous functions based on some closed form (algebraic formula). Distributions of probability are key in planning, natural language handling, and other AI objectives.
Questions tagged [probability-distribution]
78 questions
8
votes
1 answer
What are the main benefits of using Bayesian networks?
I have some trouble understanding the benefits of Bayesian networks.
Am I correct that the key benefit of the network is that one does not need to use the chain rule of probability in order to calculate joint distributions?
So, using the chain…

Sebastian Dine
- 181
- 1
7
votes
2 answers
Why is KL divergence used so often in Machine Learning?
The KL Divergence is quite easy to compute in closed form for simple distributions -such as Gaussians- but has some not-very-nice properties. For example, it is not symmetrical (thus it is not a metric) and it does not respect the triangular…

Federico Taschin
- 233
- 1
- 6
7
votes
1 answer
What loss function to use when labels are probabilities?
What loss function is most appropriate when training a model with target values that are probabilities? For example, I have a 3-output model. I want to train it with a feature vector $x=[x_1, x_2, \dots, x_N]$ and a target $y=[0.2, 0.3, 0.5]$.
It…

Thomas Johnson
- 173
- 3
5
votes
1 answer
Many of the best probabilistic models represent probability distributions only implicitly
I am currently studying Deep Learning by Goodfellow, Bengio, and Courville. In chapter 5.1.2 The Performance Measure, P, the authors say the following:
The choice of performance measure may seem straightforward and objective, but it is often…

The Pointer
- 527
- 3
- 17
5
votes
1 answer
Why is the Jensen-Shannon divergence preferred over the KL divergence in measuring the performance of a generative network?
I have read articles on how Jensen-Shannon divergence is preferred over Kullback-Leibler in measuring how good a distribution mapping is learned in a generative network because of the fact that JS-divergence better measures distribution similarity…

ashenoy
- 1,409
- 4
- 18
4
votes
1 answer
Why do we sample vectors from a standard normal distribution for the generator?
I am new to GANs. I noticed that everybody generates a random vector (usually 100 dimensional) from a standard normal distribution $N(0, 1)$. My question is: why? Why don't they sample these vectors from a uniform distribution $U(0, 1)$? Does the…

dato nefaridze
- 862
- 6
- 20
4
votes
1 answer
In deep learning, do we learn a continuous distribution based on the training dataset?
At least at some level, maybe not end-to-end always, but deep learning always learns a function, essentially a mapping from a domain to a range. The domain and range, at least in most cases, would be multi-variate.
So, when a model learns a…

ashenoy
- 1,409
- 4
- 18
4
votes
1 answer
How are the parameters of the Bernoulli distribution learned?
In the paper Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask, they learn a mask for the network by setting up the mask parameters as $M_i = Bern(\sigma(v_i))$. Where $M$ is the parameter mask ($f(x;\theta, M) = f(x;M \odot \theta$),…

mshlis
- 2,349
- 7
- 23
3
votes
0 answers
Relation between SDE diffusion and DDPM/DDIM
Background & Definitions
In DDPM, the diffusion backward step is described as follows (where $z\sim \mathcal{N}(0,I)$ and $x_{T}\sim \mathcal{N}(0,I)$):
and in DDIM we have
while in the SDE formulation (from the Fokker-Planck equation) the step…

snatchysquid
- 129
- 5
3
votes
1 answer
How can I make an MNIST digit recognizer that rejects out-of-distribution data?
I've done an MNIST digit recognition neural network.
When you put images in that are completely unlike its training data, it still tries to classify them as digits. Sometimes it strongly classifies nonsense data as being a specific digit.
I am…

river
- 133
- 6
3
votes
1 answer
How can a probability density value be used for the likelihood calculation?
Consider our parametric model $p_\theta$ for an underlying probabilistic distribution $p_{data}$.
Now, the likelihood of an observation $x$ is generally defined as $L(\theta|x) = p_{\theta}(x)$.
The purpose of the likelihood is to quantify how good…

hanugm
- 3,571
- 3
- 18
- 50
3
votes
1 answer
Is this referring to the true underlying distribution, or the distribution of our sample?
I am currently studying the paper Learning and Evaluating Classifiers under Sample Selection Bias by Bianca Zadrozny. In the introduction, the author says the following:
One of the most common assumptions in the design of learning algorithms is…

The Pointer
- 527
- 3
- 17
3
votes
2 answers
When should one prefer using Total Variational Divergence over KL divergence in RL
In RL, both the KL divergence (DKL) and Total variational divergence (DTV) are used to measure the distance between two policies. I'm most familiar with using DKL as an early stopping metric during policy updates to ensure the new policy doesn't…

mugoh
- 531
- 4
- 20
3
votes
1 answer
How does $\mathbb{E}$ suddenly change to $\mathbb{E}_{\pi'}$ in this equation?
In Sutton-Barto's book on page 63 (81 of the pdf):
$$\mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s,A_t=\pi'(s)] = \mathbb{E}_{\pi'}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_{t} = s]$$
How does $\mathbb{E}$ suddenly change to…

ZERO NULLS
- 147
- 8
3
votes
1 answer
What is the difference between model and data distributions?
Is there any difference between the model distribution and data distribution, or are they the same?

Bhuwan Bhatt
- 394
- 1
- 11