What does 'acting greedily' mean?

Question

I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:

$v_\pi(s)$ is known and $R_s$ is also known (only).
$q_{\pi}(s, a)$ is known and $R_s^a$ is also known (only).

Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it. — , Mar 05 '19 at 10:30

score 3 · Accepted Answer · edited Jun 17 '20 at 09:57

In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".

It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $\gamma = 0$ then it is the same as acting greedily with respect to the value function.

When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.

I want to know the actions that will be taken in 2 cases:

$v_\pi(s)$ is known and $R_s$ is also known (only).

To act greedily in RL, you would use the value function $v_\pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:

$$\pi'(s) = \text{argmax}_a \sum_{s'} P_{ss'}^a(R_{s} + \gamma v_{\pi}(s'))$$

Notes:

This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.
The policy gained from acting greedily with respect to $v_\pi(s)$ is not $\pi$, it is (a usually improved) policy $\pi'$

$q_{\pi}(s, a)$ is known and $R_s^a$ is also known (only).

To act greedily in RL, you would use the value function $q_{\pi}(s, a)$, and this is much simpler:

$$\pi'(s) = \text{argmax}_a q_{\pi}(s, a)$$

Again, the policy gained from acting greedily with respect to $q_\pi(s,a)$ is not $\pi$, it is (a usually improved) policy $\pi'$

Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$) — , Mar 05 '19 at 11:21
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae. — Neil Slater, Mar 05 '19 at 11:22

nbro · Answer 2 · 2019-03-05T16:49:17.883

In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).

In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.

I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_\pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.

So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_\text{greedy} = \arg \max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.

If you have $q_\pi(s, a)$ (that is, the state-action value function for a fixed policy $\pi$), then, at time step $t$, the greedy action (with respect to $q_\pi(s, a)$) from state $s$ is $a_\text{greedy} = \arg \max_{a}q_\pi(s, a)$. If you then take action $a_\text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_\pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_\pi(s, a)$ might not be the optimal state-action value function. If $q_\pi(s, a) = q_{\pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_\text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.

If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = \max_{a} q_{\pi^*}(s, a)$. So, in that case, $a_\text{greedy} = \arg \max_{a}q_{\pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_\pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).

score 1 · Answer 3 · answered Mar 05 '19 at 12:24

Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).

What does 'acting greedily' mean?

3 Answers3