Can this be a possible deep q learning pseudocode?

Question

I am not using replay here. Can this be a possible deep q learning pseudocode?

s - state    
a - action    
r - reward
n_s - next state
q_net - neural network representing q

step()
{

    get s,a,r,n_s
    q_target[s,a]=r+gamma*max(q_net[n_s,:])
    loss=mse(q_target,q_net[s,a])
    loss.backprop()

}

while(!terminal)
{    
    totalReturn+=step();
}

Neil Slater · Accepted Answer · 2020-04-02T08:11:17.720

It looks generally valid to me. There are a couple of things missing/implied that I'd like to give feedback on though:

I am not using replay here

Then it won't work, except for the most simple and trivial of problems (where you probably would not need a neural network anyway).

get s,a,r,n_s

I would make it more explicit where you get these values from, and split up the assignments.

# before step()
s = env.start state()
a = behavior_policy(q_net[n_s,:])

  # inside step(), first action
  r, n_s = env.step(s, a)

  # ....rest of loop

  # inside step(), last actions
  s = n_s
  a = behavior_policy(q_net[n_s,:])

In the above code, env is the environment which takes state, action pairs as input and returns reward plus next state. It would be equally valid to have current state as a property of the environment, and query that when needed. The behaviour_policy is a function that selects an action based on current action values - typically this might use an $\epsilon$-greedy selection.

while(!terminal) {

You appear to run just one episode in your pseudo-code. You will want an outer loop to run many episodes. Also it is not clear how you are deciding the value of terminal - in practice you will need a handle to current environment state via some variable.

Without the outer loop to start new episodes, and some variables defined to communicate between sections of code, it is difficult to follow the code and decide where to add details (such as where env.start_state() call should go, or whether something like s = env.reset() would be more appropriate).

totalReturn+=step();

In your pseudocode you do not return anything from step and it is not clear what you hope to do with this totalReturn variable. Technically it won't equal the definition of return in RL for any state, not even the starting state if gamma < 1.0.

However, the sum of all rewards seen in an episode is a useful metric. In Deep RL it is OK to treat gamma as a solution hyperparameter, and your target metric can be expected undiscounted return from the start state.

Can this be a possible deep q learning pseudocode?

1 Answers1