1

In AlphaGo, the authors initialised a policy gradient network with weights trained from imitation learning. I believe this gives it a very good starting policy for the policy gradient network. the imitation network was trained on labelled data of (state, expert action) to output a softmax policy denoting the probability of actions for each state.

In my case, I would like to use weights learnt from the imitation network as initial starting weights for a DQN. Initial rewards from running the policy are high but start to decrease (for a while) as the DQN trains and later increases again.

This could suggest that the effect of initialising weights from the imitation network has little effect, since it kind of undo's the effect of initialising weights from the imitation network.

calveeen
  • 1,251
  • 7
  • 17
  • 1
    Could you clarify the training and transfer process? Are you training a policy network using imitation learning then copying the weights over to a value-predicting network? If you are doing so, could you clarify what the value range can be in your problem, and whether the value-predicting network uses any non-linearity in the output layer (the policy network presumaby uses softmax)? – Neil Slater Nov 14 '20 at 12:38
  • Hi Neil, yes i am training an imitation network using supervised learning first and then transferring the weights over to a value network. (The authors in alphaGo used a policy network instead). I am contemplating whether the transfer of weights would work for a value network rather than for a policy gradient network. The rewards in the problem uses sparse rewards which is observed only after long time steps, which is why I am using imitation learning to learn a good initial policy first. – calveeen Nov 14 '20 at 12:54
  • @calveeen Please, clarify what your main **specific** question is. Right now, that's not clear. Do not ask "Might anyone have an opinion on this?", i.e. do not ask for opinions, but ask a question that can be answered _objectively_. Is your question: "_Can we initialize the weights of a value network (or DQN) with the weights of another value network trained with imitation learning?_ Why doesn't this seem to work for me?" It's not clear, from your description, what exactly are you training by imitation learning (i.e. supervised learning). Is it a value network or a policy? – nbro Nov 14 '20 at 14:12
  • I suppose a value network, otherwise, not sure how you would transfer the policy's weights to a value network. Moreover, it may also be a good idea to explain more in detail what you mean by "imitation learning", i.e. which imitation learning technique are you using. It's also not clear what reward shaping has to do with your question. – nbro Nov 14 '20 at 14:18
  • sorry i have edited to make the question a little more clearer – calveeen Nov 14 '20 at 14:50
  • Could you clarify what the value range can be in your problem, and whether the value-predicting network uses any non-linearity in the output layer? Typically a DQN uses linear output layer - is that the same in your case? – Neil Slater Nov 14 '20 at 15:19
  • To me, it's not clear yet what you obtain from the imitation learning phase, when you say "I would like to use weights learnt from the imitation network". In other words, is "imitation network" a value network (and not a policy) that you learn by imitation learning, right? Of course, this should be the case, but I just wanted you to clarify this. – nbro Nov 14 '20 at 15:31
  • @nbro: It is not possible to learn a value network via imitation learning, at least not by strict definitions. The data *could* be used in off-policy reinforcement learning - my deleted answer attempts to address what could go wrong there. However, I think I have now clarified that the OP is not doing that. Instead they have trained a policy network through imitation learning, and want to take advantage of that learning to set up a value-based method – Neil Slater Nov 14 '20 at 15:56
  • @NeilSlater Hm, well, it's true that if you have only a dataset of state-action pairs (i.e. without rewards and transitions), you cannot learn a value function (which is what I think you mean, though, in my mind, I hadn't restricted IL to those datasets, but also to any supervised way of learning a policy or value network with a dataset, which can be composed of just state-action pairs, or maybe it's a dataset like an experience replay, i.e. with transitions $(s, a, r, s')$). Now, the question is then: how can you initialize the value network with the policy network's weights? – nbro Nov 14 '20 at 16:02

1 Answers1

1

My understanding is that you are first training a policy network using imitation learning. Then you are adjusting that trained network in some way to be a value network for DQN. The most obvious change would be to remove softmax activation whilst keeping the network layer sizes identical. This would then present Q estimates for all actions from any given state.

The initial estimates would not be trained Q values though, they would be the "preferences" or the logits for probabilities to support a near optimal action choice. The main thing that will be likely in the new network is that for the one near optimal action choice, the network would predict the highest action value. As you derive the target policy by taking the maximising action, initially this looks good. However, the problem is that the Q values that this network predicts can have little to no relation to the real expected returns experienced by the agent under the target policy.

Initial rewards from running the policy are high but start to decrease (for a while) as the DQN trains and later increases again.

I think what is happening is that initially the greedy policy derived from your Q network is very similar to the policy learned during imitation learning. However, the value estimates are very wrong. This leads to large error values, large corrections needed, and radical changes to network weights throughout in order to change the network from an approximate policy function to an approximate action value function. The loss of performance occurs because there is not a smooth transition between the two very different functions that also maintains correct maximising actions.

I don't think this can be completely fixed. However you might get some insight into potential work-arounds by considering that you are not just doing imitation learning here. Instead you are performing both imitation learning (to copy a near optimal policy) and transfer learning (to re-use network weights on a related task).

Approaches that help with transfer learning may also help here. For instance, you could freeze the layers closer to input features, or reduce the learning rate for those layers. You do either of these things on the assumption that the low-level derived features (in the hidden layers) that the first network has learned are still useful for the new task.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Thanks for the advice @Neil :-) Also with regard to the policy gradient network that alphaGo uses with initialisation from imitation learning, would you happen to know what kind of policy gradient network did they use ? I would think that if they used a sort of actor-critic network then i would think that the initial benefits of having a good policy might be "erased" by an untrained critic network – calveeen Nov 15 '20 at 05:58
  • @calveen: "Policy Gradient" is not a type of network, but a type of training - there are constraints on the type of network though, such as its output must represent probabilities of taking actions. In AlphaGo the "imitation training" policy network took input as board state, and predicted where a human expert player would play. It was combined with other networks *later* using MCTS and to create the full AlphaGo agent. At least two of the other networks were trained using a variant of Actor-Critic reinforcement learning (a policy gradient approach). – Neil Slater Nov 15 '20 at 11:11
  • I see thank you for clarifying the terminology. I think the authors referred to the policy network as the "RL policy network" and they mentioned training it with self play. However, they did not mentioned the specific "type" of training in the sense that they could be using vanilla REINFORCE algorithm or actor critic type methods, though i highly doubt they use a vanilla policy gradient – calveeen Nov 15 '20 at 15:17