For questions related to the concept of reward, for example, in the context of reinforcement learning and Markov decision processes. For questions related to reward functions, reward design, reward shaping, reward hacking, etc., there are also those specific tags, so you use them instead of this generic one, unless your question is also about the concept of reward.
Questions tagged [rewards]
118 questions
14
votes
6 answers
What would motivate a machine?
Currently, within the AI development field, the main focus seems to be on pattern recognition and machine learning. Learning is about adjusting internal variables based on a feedback loop.
Maslow's hierarchy of needs is a theory in psychology…

Aleksei Maide
- 251
- 2
- 14
12
votes
3 answers
Why is the reward in reinforcement learning always a scalar?
I'm reading Reinforcement Learning by Sutton & Barto, and in section 3.2 they state that the reward in a Markov decision process is always a scalar real number. At the same time, I've heard about the problem of assigning credit to an action for a…

Sid Mani
- 223
- 1
- 4
10
votes
1 answer
What is the difference between expected return and value function?
I've seen numerous mathematical explanations of reward, value functions $V(s)$, and return functions. The reward provides an immediate return for being in a specific state. The better the reward, the better the state.
As I understand it, it can be…

user3168961
- 221
- 2
- 6
10
votes
2 answers
How do I handle negative rewards in policy gradients with the cross-entropy loss function?
I am using policy gradients in my reinforcement learning algorithm, and occasionally my environment provides a severe penalty (i.e. negative reward) when a wrong move is made. I'm using a neural network with stochastic gradient descent to learn the…

jstaker7
- 209
- 1
- 2
- 5
8
votes
2 answers
Why does the "reward to go" trick in policy gradient methods work?
In the policy gradient method, there's a trick to reduce the variance of policy gradient. We use causality, and remove part of the sum over rewards so that only actions happened after the reward are taken into account (See here…

Konstantin Solomatov
- 288
- 2
- 10
7
votes
2 answers
Is there any difference between reward and return in reinforcement learning?
I am reading Sutton and Barto's book on reinforcement learning. I thought that reward and return were the same things.
However, in Section 5.6 of the book, 3rd line, first paragraph, it is written:
Whereas in Chapter 2 we averaged rewards, in…

SJa
- 371
- 2
- 15
6
votes
2 answers
What is the difference between a loss function and reward/penalty in Deep Reinforcement Learning?
In Deep Reinforcement Learning (DRL) I am having difficulties in understanding the difference between a Loss function, a reward/penalty and the integration of both in DRL.
Loss function: Given an output of the model and the ground truth,…

Theo Deep
- 175
- 1
- 5
6
votes
1 answer
Reward interpolation between MDPs. Will an optimal policy on both ends stay optimal inside the interval?
Say I've got two Markov Decision Processes (MDPs):
$$\mathcal{M_0} = (\mathcal{S}, \mathcal{A}, P, R_0),\quad\text{and}\quad\mathcal{M}_1 = (\mathcal{S}, \mathcal{A}, P, R_1)$$
Both have the same set of states and actions, and the transition…

Kostya
- 2,416
- 7
- 23
6
votes
2 answers
Why does shifting all the rewards have a different impact on the performance of the agent?
I am new to reinforcement learning. For my application, I have found out that if my reward function contains some negative and positive values, my model does not give the optimal solution, but the solution is not bad as it still gives positive…

Fishfish
- 61
- 2
6
votes
1 answer
Why cannot an AI agent adjust the reward function directly?
In standard Reinforcement Learning the reward function is specified by an AI designer and is external to the AI agent. The agent attempts to find a behaviour that collects higher cumulative discounted reward. In Evolutionary Reinforcement Learning…

rodan
- 61
- 2
6
votes
2 answers
Reinforcement Learning with long term rewards and fixed states and actions
I have read a lot about RL algorithms, that update the action-value function at each step with the currently gained reward. The requirement here is, that the reward is obtained after each step.
I have a case, where I have three steps, that have to…

Jan
- 351
- 3
- 13
5
votes
1 answer
Non-differentiable reward function to update a neural network
In Reinforcement Learning, when reward function is not differentiable, a policy gradient algorithm is used to update the weights of a network. In the paper Neural Architecture Search with Reinforcement Learning they use accuracy of one neural…

samsambakster
- 71
- 5
5
votes
1 answer
If the current state is $S_t$ and the actions are chosen according to $\pi$, what is the expectation of $R_{t+1}$ in terms of $\pi$ and $p$?
I'm trying to solve exercise 3.11 from the book Sutton and Barto's book (2nd edition)
Exercise 3.11 If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms…

tmaric
- 382
- 2
- 8
4
votes
1 answer
$E_{\pi}[R_{t+1}|S_t=s,A_t=a] = E[R_{t+1}|S_t=s,A_t=a]$?
I would like to solve the first question of Exercise 3.19 from Sutton and Barto:
Exercise 3.19 The value of an action, $q_{\pi}(s, a)$, depends on the expected next reward and
the expected sum of the remaining rewards. Again we can think of this in…

user
- 145
- 9
4
votes
2 answers
Why is regret so defined in MABs?
Consider a multi-armed bandit(MAB). There are $k$ arms, with reward distributions $R_i$ where $1 \leq i \leq k$. Let $\mu_i$ denote the mean of the $i^{th}$ distribution.
If we run the multi-armed bandit experiment for $T$ rounds, the "pseudo…

stoic-santiago
- 1,121
- 5
- 18