4

I'm working on a continuous state / continuous action controller. It shall control a certain roll angle of an aircraft by issuing the correct aileron commands (in $[-1, 1]$).

To this end, I use a neural network and the DDPG algorithm, which shows promising results after about 20 minutes of training.

I stripped down the presented state to the model to only the roll angle and the angular velocity, so that the neural network is not overwhelmed by state inputs.

So it's a 2 input / 1 output model to perform the control task.

In test runs, it looks mostly good, but sometimes, the controller starts thrashing, i.e. it outputs flittering commands, like in a very fast bangbang-controlm which causes a rapid movement of the elevator.

enter image description here

Even though this behavior kind of maintains the desired target value, this behavior is absolutely undesirable. Instead, it should keep the output smooth. So far, I was not able to detect any special disturbance that starts this behavior. Yet it comes out of the blue.

Does anybody have an idea or a hint (maybe a paper reference) on how to incorporate some element (maybe reward shaping during the training) to avoid such behavior? How to avoid rapid actuator movements in favor of smooth movements?

I tried to include the last action in the presented state and add a punishment component in my reward, but this did not really help. So obviously, I do something wrong.

nbro
  • 39,006
  • 12
  • 98
  • 176
opt12
  • 171
  • 4
  • Try applying first order filter on the output. That should smooth out the control signal – Brale Jan 31 '20 at 13:07
  • This should definitively work. It's kind of the engineering approach and I think I'll try that. However, there should be a way to prevent the agent from learning such behavior. So, good ideas are still welcome. – opt12 Jan 31 '20 at 14:18
  • Btw, are you learning through entire trajectory? Idk if you're familiar with adaptive control, but there, if you do online parameter estimation, you need persistent excitation, otherwise estimation problem becomes singular and model estimate drifts off. Similar thing might happen here where network keeps getting same learning examples of steady state and it 'forgets' how to control the system. You might try adding disturbance signal every now and then to excite dynamics so that experience replay has variety of different examples or you could stop learning through entire steady state. – Brale Jan 31 '20 at 14:42

1 Answers1

1

After some research on the subject, I found a possible solution to my problem of high frequency oscillations in continuous control using DDPG:

I added a reward component based on the actuator movement, i. e. the delta of actions from one step to the next.

Excessive action changes are punished now and this could mitigate the tendency to oscillate. The solution is nnot really perfect, but it works for the moment.

This finding is detailed out in the "Reward Engineering" section of my master's thesis. Please have a look into https://github.com/opt12/Markov-Pilot/tree/master/thesis

I'll be glad to get feedback on it. And I'll be glad to hear better solutions than adding a delta-punishment.

Regards, Felix

opt12
  • 171
  • 4