Off-policy full-random training in easy-to-explore environment

Question

Let say we are in an environment where a random agent can easily explore all the states of an environment (for example: tic-tac-toe).

In those environments, using off-policy algorithm, is it a good practice to train using exclusively random actions, instead or epsilon-greedy, Boltzmann or whatever ?

For my mind, it seems logical, but I have never heard about it before.

What makes you think that, given that DQN uses the experience replay (which is what I assume you mean by "DQN is offline"), then you never need to exploit? Can you also confirm that by "offline" you mean that it uses the experience replay or maybe you meant something else? — nbro, Nov 07 '20 at 21:41
By "offline" I mean training a policy with samples not collected by this policy (isn't it the literature meaning ?). When you ask "then you never need to exploit" ? this is my question too. If we are using offline algo, and if we are able to collect data from all possible states with only random actions, is there a need for standard exploration strategies like e-greedy or Boltzmann, instead of a simple random exploration ? — Loheek, Nov 07 '20 at 22:56
That's "off-policy" and not "offline". Check [this](https://ai.stackexchange.com/q/10474/2444). — nbro, Nov 07 '20 at 22:59
Thank you for this correction ! I confused, I have updated my question. — Loheek, Nov 07 '20 at 23:14
Ok, but I still don't fully understand the relationship between "being off-policy" and "always taking random actions" that you have in mind. Maybe you're thinking that, given that they are off-policy, in principle, you can use any behaviour policy, so you can just take random actions with off-policy algorithms. I think that would be possible. In fact, as far as I know, tabular Q-learning converges provided that you explore all states enough, which a random policy should do. However, note that, with function approximation (i.e. DQN), "enough exploration" may not be enough for convergence. — nbro, Nov 07 '20 at 23:22
I've edited your post to put in the title what I think is your main question/concern. Please, make sure that's the case. Feel free to edit again your post to correct it, if that's not the case. — nbro, Nov 07 '20 at 23:32
There is a misunderstanding, my question was exclusively about environments where a random exploration can explore every states, you didn't mention that in your edit (that is the purpose of my post) — Loheek, Nov 07 '20 at 23:51
As far as I know, an off-policy RL algorithm need to exploit to discover states it couldn't reach with random moves. The purpose of my question is to know whether or not it should be necessary in environments where all states can be reached with random moves. If you agree there is a misunderstanding, can you please restore the post ? — Loheek, Nov 07 '20 at 23:56
Do you know the difference between "exploitation" and "exploration" in RL? From what you said (i.e. you said "need to exploit to discover states"), it seems you don't know that these terms have specific meanings in RL. See [this answer](https://ai.stackexchange.com/a/23679/2444). — nbro, Nov 08 '20 at 11:34
Thank you but I perfectly know the difference (same as offline/off-policy, it was just a typo yesterday because I wrote the post late). I did not say "need to exploit to discover states", I said "need to exploit to discover states it couldn't reach with random moves". In most environments, you cannot reach complex states with only random exploration. So, a good exploration/exploitation tradeoff allow to further explore difficult-to-reach states. And so, I am asking why exploitation should be necessary in an env where random exploration can reach all states. — Loheek, Nov 08 '20 at 14:03
If you have infinite time and lives and all states are reachable from the start state, you could reach all states by random actions, in all environments where these conditions are met. I still don't understand what your question is, to be honest. The main requirement for convergence in tabular Q-learning is that you explore enough (i.e., in principle, you can just take random actions to "explore enough"), and not that you exploit enough (i.e. take the best current action), so that "you update the values of all states" (this is the idea). — nbro, Nov 08 '20 at 14:06
Note that by taking random actions, you eventually end up in _all_ different states, so you eventually reach all states (implied by taking random actions) and perform all actions (by definition of your behaviour policy). That's why you can "explore enough" with a random policy. Now, the question is: is it a good idea? Probably not, given that exploitation may help to converge faster (i.e. you should be updating values for actions that are best to reach to some, at least, local optima). — nbro, Nov 08 '20 at 14:11
Why are you talking about tabular Q-learning ? The question was about algorithms such as DQN and environments where you cannot store all the states in a lookup table, and thus prefer using an approximator Q-function. "you can just take random actions to "explore enough" -> that is an answer to my question. So you are saying exploitation is not necessary in those cases. "If you have infinite time and lives", I really don't see why it should be more time expensive than an optimized policy, in envs such as tic-tac-toe easily explorable with random actions — Loheek, Nov 08 '20 at 14:24
I was talking about tabular Q-learning because in that case Q-learning is guaranteed to converge. DQN is Q-learning with function approximation and experience replay. As far as I know, DQN is not guaranteed to converge to the optimal value function, that's why I was restricting my reasoning to the tabular case. — nbro, Nov 08 '20 at 14:28
"Note that by taking random actions, you eventually end up in all different states" -> The condition of my question is that we are in environments where an agent that takes random moves can reach all states as well as any other possible policies — Loheek, Nov 08 '20 at 14:28
I posted this on Reddit (didn't know there was a RL channel), so I will delete this post. I expect with your highlighting the post is better formulated. I would be curious to know whether it is clearer or not. Thank you for the time you spent to help me https://www.reddit.com/r/reinforcementlearning/comments/jqvuwm/fullrandom_exploration_in_specific_environments/ — Loheek, Nov 09 '20 at 17:42
To be honest, in that Reddit post, you're still apparently mixing the terms exploration and exploitation, so that post/question is not fully clear, as well as this one. I suggest that you reformulate your questions based on my suggestions. — nbro, Nov 09 '20 at 18:03
Well so I think there is still a misunderstanding. By "I removed the exploration strategy" I mean "use a random policy as exploration strategy" (this is my only mention of exploration so I suppose you're talking about that). I still don't see the mixing you are talking about. Btw, I get the exact responses I was looking for, and the others seem to understand it pretty well so never mind. Thank you and good continuation — Loheek, Nov 09 '20 at 18:12
Yes, exactly, I was referring to that part "exploration strategy" (you should have said "I removed the exploitation"). Of course, I know what you mean, but that wording doesn't make the question very clear. That's what I mean. — nbro, Nov 09 '20 at 18:14

Off-policy full-random training in easy-to-explore environment

0 Answers0