2

In a general DQN framework, if I have an idea of some actions being better than some other actions, is it possible to make the agent select the better actions more often?

nbro
  • 39,006
  • 12
  • 98
  • 176
user3656142
  • 185
  • 5
  • 1
    you should construct your reward function so that it will reflect the main goal and the algorithm will learn the best action by itself – Aray Karjauv Nov 15 '20 at 13:41
  • take a look at my answers in similar questions: [this one](https://ai.stackexchange.com/questions/24157/how-should-i-define-the-reward-function-to-solve-the-wumpus-game-with-deep-q-lea/24164#24164), and [one more](https://ai.stackexchange.com/questions/18425/how-can-i-implement-the-reward-function-for-an-8-dof-robot-arm-with-trpo/18436#18436) – Aray Karjauv Nov 15 '20 at 13:47

1 Answers1

4

For single-step Q learning, the behaviour policy can be any stochastic policy without any further adjustment to the update rules.

You don't have to use $\epsilon$-greedy based on current Q function approximation, although that is a common choice because it works well in general cases. However, you should always allow some chance of taking all actions if you want the algorithm to converge - if you fixed things so that bad actions were never taken, the agent would never learn that they had low value.

Probably the simplest way to use your initial idea of best actions is to write a function that returns your assessment of which action to take, and use that with some probability in preference to a completely random choice. At some point you will also want to stop referencing the helper function (unles it is guaranteed perfect) and use some form of standard $\epsilon$-greedy based on current Q values.

I have done similar with a DQN learning to play Connect 4, where the agent would use a look-ahead search function that could see e.g. 7 steps ahead. If that was inconclusive it would use argmax of current Q values. Both these fixed action choices could be replaced, with probability $\epsilon$, with a random action choice to ensure exploration. It worked very well. You could replace the look-ahead search in my example with any function that returned "best" actions for any reason.

There are some other ways you can skew action selection towards better looking action choices. You could look into Boltzmann exploration or upper confidence bounds (UCB) as other ways to create behaviour policies for DQN.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • The connect 4 example looks to be a very nice approach to implement what I need. When you implement the look ahead search, do you mean look ahead in time on what might be the consequence of the actions ? Doesn't that mean you're still somewhat doing an argmax of the actions ? – user3656142 Nov 15 '20 at 15:36
  • @user3656142: My search chose by win/loss only, ignoring Q values. So it was an independent function. Yes it was an argmax over that independent function. The function was technically a negamax ( https://en.wikipedia.org/wiki/Negamax ) based on the game rules. – Neil Slater Nov 15 '20 at 19:08
  • @NeilSlater: So your suggestion is that for each step predefined actions are choosen with a certain probability and with (1- probability) a Deep-Q-learning action should be choosen? – PeterBe Nov 15 '21 at 13:14
  • @PeterBe: My suggestion for OP here is to use a modified $\epsilon$-greedy, where they could use their initial guesses as best action in place of the greedy action. In practice, they will also want to retire the best guess component at some stage - whilst I didn't need to for Connect 4, because a negamax search that resolves to the end of the game is always perfect. – Neil Slater Nov 15 '21 at 14:15
  • @NeilSlater: Thanks for your comment. But did the OP not ask whether you can make some actions more likely in Deep-Q-Learning? Is your suggestion not to use Deep-Q-learning but instead use e-greedy? Or is your suggestion actually trying to combine e-greey with Deep-Q-Learning? Further, I don't understand at all the second part of your comment "whilst I didn't need to for Connect 4, because a negamax search that resolves to the end of the game is always perfect." --> What do you mean by negamax search and why is it perfect at the end of the game? – PeterBe Nov 15 '21 at 15:51
  • 1
    @PeterBe: As per the answer Q-Learning action choice is flexible, so the OP is still "using" Deep-Q-Learning, but without the most commonly chosen behaviour policy. Negamax search: https://en.wikipedia.org/wiki/Negamax - it is generally only used in 2-player games, but within those is a well-known AI technique. If you have further questions about that, they would definitely be on topic for this site. – Neil Slater Nov 15 '21 at 17:22