1

I am using DQN to find the optimal sequence of control inputs to a dynamic system. The setup is as follows:

  • At the beginning of each episode, the system is initialized to the SAME initial condition s0.
  • Each episode spans from 0 seconds to tf seconds. (=each episode takes the same time) A decision is made every h seconds. Thus there are tf/h iterations per episode.
  • In each iteration, the agent takes an action (He chooses 4 control inputs. For each input there are 3 seperate options to choose from). At the beginning of the next iteration, the agent observes the updated state and selects a new action (again this means that he chooses one of 3 possible values in each of the 4 control channels). Each iteration takes h seconds.
  • The reward is computed at the end of the episode (i.e. after t=tf).
  • The state observation includes the current time t. Thus, the agent knows how much time he has left within the episode.

The performance of the dynamic system depends on many parameters. (Considering the agent would control a car, exemplary parameters would be the weight or the aerodynamic drag coefficient of the car.) Changing these parameters will affect the dynamic behavior of the system, i.e. the environment, for the whole episode.

How can I include the optimization of these parameters in the problem? In essence I want to find the optimal combination of time-varying control inputs + parameters.

I though about including the parameters as actions. I then would have 5 action channels (4 control inputs a 3 options each and the parameter with x discrete values). However, considering that changing the parameter is only meaningful before starting an episode, taking this parameter action would only have an "effect" at t=0, i.e. when the first action is taken. (It makes no sense that the parameters are changed within an episode.) As a consequence, this "parameter action" would have zero effect for all t > 0. I do not know if this hinders the learning of the DQN agent, as this would increase the size of the action space.

Another idea of mine was pursuing a bi-level approach. Thus, I first train an agent that knows how to drive the car optimally (wrt. to my reward function). After training I could apply this pretrained agent to a range of cars, each with different parameter settings. Ideally I would then see where he performs best. However, this approach is not really "optimal", as I never optimize for both the control inputs and the parameters in one go...

0 Answers0