-1

I am using TD3 on a custom gym environment, but the problem is that the action values stick to the end. Sticking to the end values makes reward negative, to be positive it must find action values somewhere in the mid. But, the agent doesn't learn that and keeps action values to maximum.

I am using one step termination environment (environment needs actions once for each episode).

How can I improve my model? I want action values to be roughly within 80% of maximum values.

In DDPG, we have inverted gradients, but could something similar be applied to TD3 to make action values search within legal action space more?

The score decreases as episodes increases.

enter image description here

nbro
  • 39,006
  • 12
  • 98
  • 176
K_197
  • 1
  • 3
  • What do you mean by "the action values stick to the end"? Can you also explain why "Sticking to the end values makes reward negative" and "to be positive it must find action values somewhere in the mid" are true? – nbro Apr 14 '22 at 12:33
  • could you please elaborate more about the solution? I am facing the same problem, and I am unsure how to solve it. – Raz Apr 13 '22 at 22:51

1 Answers1

0

I found the solution, it was changing reward function and using reward scaling. A little bit change in architecture and learning rate fixed the problem.

K_197
  • 1
  • 3