3

I'm training a robot to walk to a specific $(x, y)$ point using TD3, and, for simplicity, I have something like reward = distance_x + distance_y + standing_up_straight, and then it adds this reward to the replay buffer. However, I think that it would be more efficient if it can break the reward down by category, so it can figure out "that action gave me a good distance distance_x, but I still need work on distance_y and standing_up_straight".

Are there any existing algorithms that add rewards this way? Or have these been tested and proven not to be effective?

nbro
  • 39,006
  • 12
  • 98
  • 176
pinkie pAI
  • 35
  • 3

2 Answers2

5

If I understood correctly you're looking at a Multi-Objective Reinforcement Learning (MORL). Keep in mind however that many scientist will often follow the reward hypothesis (Sutton and Barto) which says that

All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)

The argument for a scalar reward could be that even if you define your policy using some objective vector (as in MORL) - you will find a pareto bound of optimal policies, some of which favour one component of the objective over the other - leaving you (the scientist) responsible for making the ultimate decision concerning the objectives' tradeoff - thus eventually degenerating the reward objective into scalar.

In your example there might be two different "optimal" policies - one which results in a very high value of distance_x but relatively poor distance_y and a one that favours distance_y instead. It'll be up to you to find the sweet spot and collapse a reward function back to a scalar.

2

I agree with Tomasz that the approach you are describing falls within the field of MORL. For a solid introduction MORL I would recommend the survey by Roijers, D. M., Vamplew, P., Whiteson, S., & Dazeley, R. (2013). A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48, 67-113.

https://www.jair.org/index.php/jair/article/view/10836 (disclaimer: I'm an author in this, but I genuinely believe it will be useful to you).

Our survey provides arguments for the need for multiobjective methods by describing three scenarios where agents using single-objective RL may be unable to provide a satisfactory solution which matches the needs of the user. Briefly these are (a) the unknown weights scenario where the required trade-off between the objectives isn't known in advance, and so to be effective the agent must learn multiple policies corresponding to different trade-offs and then at run-time select the one which matches the current preferences (eg this can arise when the objectives correspond to different costs which vary in relative price over time; (b) the decision support scenario where scalarization of a reward vector is not viable (for example, in the case of subjective preferences which defy explicit quantification), so the agent needs to learn a set of policies, and then present these to a user who will select their preferred option, and (c) the known weights scenario where the desired trade-off between objectives is known but its nature is such that the returns are non-additive (ie if the user's utility function is non-linear) and therefore standard single-objective methods based on the Bellman equation can't be directly applied.

We propose a taxonomy of MORL problems in terms of the number of policies they require (single or multi-policy), the form of utility/scalarization function supported (linear or non-linear), and whether deterministic or stochastic policies are allowed, and relate this to the nature of the set of solutions which the MO algorithm needs to output. This taxonomy is then used to categorise existing MO planning and MORL methods.

One final important contribution is identifying the distinction between maximising Expected Scalarised Return (ESR) or Scalarised Expected Return (SER). The former is appropriate in cases where we are concerned about the results within each individual episode (for example, when treating a patient - that patient will only care about their own individual experience), while SER is appropriate if we care about the average return over multiple episodes. This has turned out to be a much more important issue than I anticipated at the time of the survey, and Diederik Roijers and his colleagues have examined it more closely since then (eg http://roijers.info/pub/esr_paper.pdf)

nbro
  • 39,006
  • 12
  • 98
  • 176
Peter Vamplew
  • 121
  • 1
  • 4
  • Given that this answer is basically a copy-and-paste of your other answer, I would actually suggest that, in this case, you just give a little introduction and ask the OP to read your other answer. In this case, it's fine because the answer is on the site (and this is really just a copy and paste of the other). – nbro Jan 13 '21 at 13:22