2

Good day, it's a pleasure having joined this Stack.

In my master thesis I have to expand a Deep Reinforcement Learning Network, to be precise a Deep Q-Network, which is used to control machines in an electrical grid for power quality management.

What would be the best way to evaluate if a network is doing a good job during training or not? Right now I have access to the reward function as well as the q_value function.

The rewards consist of 4 arrays, one for each learning criteria of the network. The first tuple is a hard criteria (adherence mandatory) while the latter 3 are soft criteria:

Episode: 1/3000 Step: 1/11 Reward: [[1.0, 1.0, -1.0], [0.0, 0.68, 1.0], [0.55, 0.55, 0.55], [1.0, 0.62, 0.79]]
Episode: 1/3000 Step: 2/11 Reward: [[-1.0, 1.0, 1.0], [0.49, 0.46, 0.67], [0.58, 0.58, 0.58], [0.77, 0.84, 0.77]]
Episode: 1/3000 Step: 3/11 Reward: [[-1.0, 1.0, 1.0], [0.76, 0.46, 0.0], [0.67, 0.67, 0.67], [0.77, 0.84, 1.0]]

The q_values are arrays which I do not fully understand yet. Could one of you explain them to me? I read the official definiton of Q-Values positive False Discovery Rate. Can these values be used to evaluate neural network training? These are the Q-Values for step 1:

Q-Values: [[ 0.6934726  -0.24258053 -0.10599071 -0.44178435  0.5393113  -0.60132784
  -0.07680141  0.97968364  0.7707691   0.57855517  0.16273917  0.44632837
   0.00799532 -0.53355324 -0.45182624  0.9229134  -1.0455914  -0.0765233
   0.37784138  0.14711905  0.10986999  0.08918551 -0.8189287   0.14438646
   0.8869624  -0.43251887  0.7742889  -0.7671829   0.07737591  0.2569678
   0.5102049   0.5132051  -0.31643414 -0.0042788  -0.66071266 -0.18251896
   0.7762838   0.15322062 -0.06284399  0.18447408 -0.9609979  -0.4508798
  -0.07925312  0.7503184   0.6858963  -1.0436649  -0.03167241  0.87660617
  -0.43605536 -0.28459656 -0.5564517   1.2478396  -1.1418368  -0.9335588
  -0.72871417  0.04163677  0.30343965 -0.30024529  0.08418611  0.19429305
   0.44063848 -0.5541725   0.5740701   0.76789933 -0.9621064   0.0272104
  -0.44953588  0.13415053 -0.07738207 -0.16188647  0.6667519   0.31965214
   0.3241703  -0.27273563 -0.07130697  0.49683014  0.32996863  0.485767
   0.39242893  0.40508035  0.3413986  -0.5895434  -0.05772913 -0.6172271
  -0.12423459  0.2693861   0.32966745 -0.16036317 -0.36371914 -0.04342368
   0.22878243 -0.09400887 -0.1134861   0.07647536  0.04724833  0.2907955
  -0.70616114  0.71054566  0.35959414 -1.0539075   0.19137645  1.1948669
  -0.21796732 -0.583844   -0.37989947  0.09840107  0.31991178  0.56294084]]

Are there other ways of evaluating DQNetworks? I would also appreciate literature about this subject. Thank you very much for your time.

  • 1
    could you provide more detail? what is the action space (how many actions) does your problem have? – David May 15 '20 at 12:15

1 Answers1

3

Q-values represent expected return after taking action $a$ in state $s$, so they do tell you how good it is to take an action in the specific state. Better actions will have larger Q-values. Q-values can be used to compares actions but they are not very meaningful in representing performance of the agent since you have nothing to compare them with. You don't know the actual Q-values so you can't conclude if your agent is approximating well those Q-values or not.

Better performance metric would be the average reward per episode/epoch, or average reward in last $N$ timesteps for continuing tasks. If your agent is improving its performance then it's average reward should be increasing. You said that you have rewards per state and that some of them represent more important criteria then others. You could plot the average reward per episode by doing some kind of weighted linear combination of criteria rewards \begin{equation} \bar R = \bar R_1 w_1 + \bar R_2 w_2 + \bar R_3 w_3 + \bar R_4 w_4 \end{equation} where $\bar R_i$ is the average episode reward for criteria $i$.That way you can provide more importance to some specific criteria in your evaluation.

Brale
  • 2,306
  • 1
  • 5
  • 14