1

In policy gradients, is it possible to learn the policy if the chain of actions is selected and performed manually/externally (e.g. by myself or by someone else who I have no influence over)?

For example, we have four actions, and I choose in the beginning an action 2, and we end up in a given state, then I choose action 4 and we end up in another state, etc. (the actions can follow some logic or not but the question is general; some of the actions will end up with positive rewards).

Can we learn any meaningful policy network from such a chain of actions?

cerebrou
  • 141
  • 1
  • 3
  • Hello. It may be a good idea to explain why you would like to do this, and how exactly do you plan to choose the actions. – nbro Oct 13 '21 at 13:09
  • It seems that what you're proposing is related to [imitation learning](https://ai.stackexchange.com/q/9595/2444), but maybe I am wrong, as imitation learning is really not reinforcement learning but supervised learning applied to an RL problem. You're suggesting that humans provide/specify the behavioural policy. I don't remember now all the details of policy gradients to answer this question, but, in the case of Q-learning (which is off-policy), you could in principle use any exploratory policy that explores enough the environment, so even a policy decided by a human. – nbro Oct 13 '21 at 15:45
  • For policy gradient, the policy (i.e., which actions to take) is fully determined by the parameter $\theta$. Then, gradient ascent can be used to optimize this parameter. However, when you have external interference, the assumption that the policy is determined by $\theta$ is invalid anymore. In this case, I guess you need other methods. –  Oct 16 '21 at 01:07

0 Answers0