3

I am training an agent to do object avoidance. The agent has control over its steering angle and its speed. The steering angle and speed are normalized in a $[−1,1]$ range, where the sign encodes direction (i.e. a speed of −1 means that it is going backwards at the maximum units/second).

My reward function penalises the agent for colliding with an obstacle and rewards it for moving away from its starting position. At a time $t$, the reward, $R_t$, is defined as $$ R_t= \begin{cases} r_{\text{collision}},&\text{if collides,}\\ \lambda^d\left(\|\mathbf{p}^{x,y}_t-\mathbf{p}_0^{x,y}\|_2-\|\mathbf{p}_{t-1}^{x,y}-\mathbf{p}_0^{x,y}\|_2 \right),&\text{otherwise,} \end{cases} $$ where $\lambda_d$ is a scaling factor and $\mathbf{p}_t$ gives the pose of the agent at a time $t$. The idea being that we should reward the agent for moving away from the inital position (and in a sense 'exploring' the map—I'm not sure if this is a good way of incentivizing exploration but I digress).

My environment is an unknown two-dimensional map that contains circular obstacles (with varying radii). And the agent is equipped with a sensor that measures the distance to nearby obstacles (similar to a 2D LiDAR sensor). The figure below shows the environment along with the agent.

environment

Since I'm trying to model a car, I want the agent to be able to go forward and reverse; however, when training, the agent's movement is very jerky. It quickly switches between going forward (positive speed) and reversing (negative speed). This is what I'm talking about.

One idea I had was to penalise the agent when it reverses. While that did significantly reduce the jittery behaviour, it also caused the agent to collide into obstacles on purpose. In fact, over time, the average episode length decreased. I think this is the agent's response to the reverse penalties. Negative rewards incentivize the agent to reach a terminal point as fast as possible. In our case, the only terminal point is obstacle collision.

So then I tried rewarding the agent for going forward instead of penalising it for reversing, but that did not seem to do much. Evidently, I don't think trying to correct the jerky behaviour directly through rewards is the proper approach. But I'm also not sure how I can do it any other way. Maybe I just need to rethink what my reward signal wants the agent to achieve?

How can I rework the reward function to have the agent move around the map, covering as much distance as possible, while also maintaining smooth movement?

Shon Verch
  • 65
  • 4

1 Answers1

1

I think you should try to reason in terms of total "area" explored by the agent rather than "how far" it moves from the initial point, and also you should add some reward terms to push the agent steering more often. I think that the problem with your setting is more or less this: The agent go as straight as it can because you're rewarding it for it, it start sensing an obstacle so it stops, there is no reward for steering so the best strategy to go away from the obstacle and not end the episode is just to go backwards.

Considering that you have information about the grid points at any time you could rewrite the reward function in terms of grid squared explored by checking at each move if the agent end up in a new square grid:

$$ R_t= \begin{cases} r_{\text{collision}}\\ \lambda^d\left(\|\mathbf{p}^{x,y}_t-\mathbf{p}_0^{x,y}\|_2-\|\mathbf{p}_{t-1}^{x,y}-\mathbf{p}_0^{x,y}\|_2 \right) + r_{new-squared-explored} \end{cases} $$

Moreover it would be useful to add some reward terms also related to how the agent avoid the obstacle, for example a penalisation when the sensor goes and remain under a certain threshold (to make the agent learn to not go and stay too close to an obstacle) but also a rewarding term when an obstacle is detected and the agent manage to maintain a certain distance from it (even though if not well tuned this term could lead the agent to learn to just run in circles around a single obstacle, but if tuned properly I think it might help to make the agent movements smoother).

Edoardo Guerriero
  • 5,153
  • 1
  • 11
  • 25
  • I hadn't thought of it like that..you're right! Mind clarifying the $r_{new-squared-explored}$ term? Should this be the new cells the agent visited at time $t$? – Shon Verch Aug 31 '20 at 15:29
  • @ShonVerch yes exactly. Reward n if it visits a new square at time t or 0 if it doesn't (even a penalisation could be a possibility, it depends on how much the agent should move around). Of course it works only if you have complete informations about the environment, but it seems to be the case for your setting. – Edoardo Guerriero Aug 31 '20 at 15:41
  • Alright cool. Quick question about more general reinforcement learning. Is it fine if my reward function uses information that is not accessible to the agent (I.e. not part of the observation space) at inference. Because while I have information about the grid, I only have that in the simulated environment. – Shon Verch Aug 31 '20 at 15:49