Highest Voted 'softmax-policy' Questions - Artificial Intelligence Stack Exchange

7

votes

1 answer

What happens when you select actions using softmax instead of epsilon greedy in DQN?

I understand the two major branches of RL are Q-Learning and Policy Gradient methods. From my understanding (correct me if I'm wrong), policy gradient methods have an inherent exploration built-in as it selects actions using a probability…

asked Jun 23 '20 at 16:47

Linsu Han

73
4

4

votes

1 answer

Eligibility vector for softmax policy with policy gradients

There is this nice result for policy gradients that the gradient of some performance measure, $\nabla v_{\pi_{\theta}}(s_0)$ (here, in the episodic case for the starting state $s_0$ and policy $\pi$, parametrised by some weights $\theta$) is equal…

machine-learning gradient-descent policy-gradients softmax-policy

asked Dec 05 '19 at 19:23

Gregor

203
2
9

3

votes

1 answer

What is the difference between the $\epsilon$-greedy and softmax policies?

Could someone explain to me which is the key difference between the $\epsilon$-greedy policy and the softmax policy? In particular, in the contest of SARSA and Q-Learning algorithms. I understood the main difference between these two algorithms, but…

reinforcement-learning q-learning sarsa epsilon-greedy-policy softmax-policy

asked Jan 21 '20 at 20:39

FraMan

189
2
10

2

votes

1 answer

Is a learned policy, for a deterministic problem, trained in a supervised process, a stochastic policy?

If I trained a neural network with 4 outputs (one for each action: move down, up, left, and right) to move an agent through a grid (deterministic problem). The output of the neural network is a probability distribution over the 4 actions, due to the…

neural-networks policies deterministic-policy stochastic-policy softmax-policy

asked Feb 03 '21 at 12:47

Xtalker

21
2

Questions tagged [softmax-policy]

What happens when you select actions using softmax instead of epsilon greedy in DQN?

Eligibility vector for softmax policy with policy gradients

What is the difference between the $\epsilon$-greedy and softmax policies?

Is a learned policy, for a deterministic problem, trained in a supervised process, a stochastic policy?