How to fix high variance of the returns on a 2d env?

Question

I'm trying to train an agent on a self-written 2d env, and it just doesn't converge to the solution.

It is basically a 2d game where you have to move a small circle around the screen and try to avoid collisions with randomly moving "enemy" circles and the edge of the screen. The positions of the enemies are initialized randomly, at a minimum distance of 2 diameters from the enemy. The player circle has $n$ sensors (lasers) that measure the distance and speed of the closest object found.

The observation space is continuous and is made of concatenated distances and speeds, hence has the dimension of $\mathbb{R}^{n * 3}$.
I scale the distances by the length of the screen diagonal.

The action space is discrete (multidiscrete in my implementation) $[dx, dy] \in \{-1, 0, 1\}$

The reward is +1 for every game step made without collisions.

I use PPO implementation from Stable Baselines, but the return variance just gets bigger over the training. In addition to that, the agent hasn't run away from the enemies even once. I tried even setting the negative reward, to test if he can learn the suicide behavior, but no results either.

I thought maybe it's just possible for some degenerate policies like going to the corner of the screen and staying there to gain big returns, and that jeopardizes the training. Then I increased the number of enemies, thinking that it will enforce the agent to learn actually to avoid the enemies, but it didn't work as well.

I'm really out of ideas at this point and would appreciate some help on debugging this.

What diagnostics have you already done? What behaviour are you observing for the failing agent? Do you have unit testing around the base environment, so you know for sure that the sensors are working as intended? Are the sensors wide - so anything within the sensor angle is detected? Or are they narrow and might miss unaligned enemies? — Neil Slater, Nov 16 '21 at 12:32
Could you please elaborate on "diagnostics"? Right now he moves to the lower part of the screen and then moves up and down for some time until he hits the enemy. I'd say the best he's got so far was that he learned not to collide with the border. This was observed in different policies. The sensors have been tested and are working properly. — debrises, Nov 16 '21 at 19:18
Sensors are basically just rays. They can miss some enemies that are far away, but the problem is that the agent doesn't even learn to flee the enemies standing very close. — debrises, Nov 16 '21 at 19:24
By diagnostics, I mean collecting data about training progress and outcomes in detail. E.g. graphing loss, reward, also being able to spot check an episode and view behaviour with visualisation of the environment. I was wondering about this because of your concern over "degenerate policies" - have you actually observed the agent behaving according to such policies? — Neil Slater, Nov 16 '21 at 19:25
Have you calculated the coverage of the sensors - what percentage of enemies are detected at any given range? WIthout knowing more, that would be my first concern, that there was too much hidden state due to small movements making enemies visible then invisible again. Unless the agent has a memory this could be a major problem. — Neil Slater, Nov 16 '21 at 19:30
Thanks a lot for the advice! I found a bug in how my observation was made and it started working. What helped me was that I visualized the collected state, which involved inverse transformation and a step-by-step review of my env. As a side note, the variance was still getting over the training, but it is related to the env dynamics. — debrises, Dec 02 '21 at 18:19

How to fix high variance of the returns on a 2d env?

0 Answers0