For questions related to the gradient, a way of packing together all the partial derivative information of a function
Questions tagged [gradient]
44 questions
10
votes
1 answer
What is the relationship between gradient accumulation and batch size?
I am currently training some models using gradient accumulation since the model batches do not fit in GPU memory. Since I am using gradient accumulation, I had to tweak the training configuration a bit. There are two parameters that I tweaked: the…

JVGD
- 1,088
- 1
- 6
- 14
6
votes
1 answer
How is the gradient of the loss function in DQN derived?
In the original DQN paper, page 1, the loss function of the DQN is
$$
L_{i}(\theta_{i}) = \mathbb{E}_{(s,a,r,s') \sim U(D)} [(r+\gamma \max_{a'} Q(s',a',\theta_{i}^{-}) - Q(s,a;\theta_{i}))^2]
$$
whose gradient is presented (on page…

Dimitris Monroe
- 171
- 8
5
votes
2 answers
Why is the derivative of this objective function 0 if the policy is deterministic?
In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic.
$$
\nabla_{\theta} J(\theta)=E_{\tau \sim \pi_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log…

jonperl
- 153
- 7
5
votes
2 answers
Why is tf.abs non-differentiable in Tensorflow?
I understand why tf.abs is non-differentiable in principle (discontinuity at 0) but the same applies to tf.nn.relu yet, in case of this function gradient is simply set to 0 at 0. Why the same logic is not applied to tf.abs? Whenever I tried to use…

zedsdead
- 53
- 3
5
votes
2 answers
Is the gradient at a layer independent of the activations of the previous layers?
Is the gradient at a layer (of a feed-forward neural network) independent of the activations of the previous layers?
I read this in a paper titled Mean Field Residual Networks: On the Edge of Chaos (2017). I am not sure how far this is true, because…

Snehal Reddy
- 69
- 4
4
votes
1 answer
Why is it a problem if the outputs of an activation function are not zero-centered?
In this lecture, the professor says that one problem with the sigmoid function is that its outputs aren't zero-centered. Are the explanation provided by the professor regarding why this is bad is that the gradient of our loss w.r.t. the weights…

Daviiid
- 563
- 3
- 15
3
votes
0 answers
Why does training converges when the norm of gradient increases?
This is from deep learning book by Ian Goodfellow and Yoshua Bengio and Aaron Courville.
When training converges well, I thought the gradient should be at local minima. But the book says it often does not arrive at the critical points. Could you…

tesio
- 185
- 4
3
votes
1 answer
Why is automatic differentiation still used, if today's computers can calculate symbolic derivatives quite fast?
Today's computers can calculate symbolic derivatives quite fast, why is automatic differentiation still used? For example, Mathematica can handle algebraic operations with arrays. Doesn't automatic differentiation cause significant overhead?…

asd
- 33
- 2
2
votes
0 answers
How to prepare audio data for deep learning?
Audio data is typically an array with the waveform represented by values from -1 to 1. There are two issues with that:
if all values are inverted, e.g. -1 becomes 1 and 1 becomes -1, the audio doesn't change. But if for example I need to find…

nikishev.
- 21
- 3
2
votes
0 answers
GAN : Why does a perfect discriminator mean no gradient for the generator?
In the training of a Generative Adversarial Networks (GAN) system, a perfect discriminator (D) is one which outputs 1 ("true image") for all images of the training dataset and 0 ("false image") for all images created by the generator (G).
I've read…

Soltius
- 221
- 1
- 8
2
votes
2 answers
What does it mean by "gradient flow" in the context of neural networks?
Several research papers and textbooks (e.g. this) contain the phrase "gradient flow" in the context of neural networks.
I am confused about whether it has any rigorous and formal way of understanding or not. What is the flow referring to here?

hanugm
- 3,571
- 3
- 18
- 50
2
votes
2 answers
What specifically is the gradient of the log of the probability in policy gradient methods?
I am getting tripped up slightly by how specifically the gradient is calculated in policy gradient methods (just the intuitive understanding of it).
This Math Stack Exchange post is close, but I'm still a little confused.
In standard supervised…

user9317212
- 161
- 2
- 10
2
votes
1 answer
What does it mean by strong or sufficient gradient for training in this context?
It has been mentioned in the research paper titled Generative Adversarial Nets that generator need to maximize the function $\log D(G(z))$ instead of minimizing $\log(1 −D(G(z)))$ since the former provides sufficient gradient than latter.
$$\min_G…

hanugm
- 3,571
- 3
- 18
- 50
2
votes
1 answer
What is $ \nabla_{\theta_{k-1}} \theta_{k}$ in the context of MAML?
I am attempting to fully understand the explicit derivation and computation of the Hessian and how it is used in MAML. I came across this blog: https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html.
Specifically, could someone help to…

Blake Camp
- 23
- 2
2
votes
2 answers
How can we compute the gradient of max pooling with overlapping regions?
While studying backpropagation in CNNs, I can't understand how can we compute the gradient of max pooling with overlapping regions.
That's also a question from this quiz and can be also found on this book.

estamos
- 157
- 1
- 12