How does off-policy monte carlo explore and converge?

Question

Premises to question:

Behavior Policy: e-greedy (stochastic)
Target Policy: greedy (deterministic)
Importance Sampling Included

In off-policy Monte-Carlo control, the behavior policy chooses actions to follow, and the target policy learns from those actions. However, because of importance sampling, if the behavior policy chooses an action that is not believed to be the "best" action by the target policy, then the importance sampling ratio is 0 and the algorithm disregards any learning.

My question, then, is how can the target policy ever change its preferred action if the action value is only updated when the behavior policy chooses the same action as the target policy? How is there any exploration if the target policy is greedy, because the importance sampling ratio zeros out every action from the behavior policy that is not chosen by the target policy?

Sutton's RL book says that "learning will be slow... if non-greedy actions are common".

What I don't understand is how the target policy can ever choose a different action if the only actions that count are the actions that are the same from target and behavior policy?

I've been struggling on this for the past few days, please help.

This seems to be duplicate of the linked post, but I will not vote to close it because I am also not super satisfied with the other answer(s) and I don't think they address everything in this post. So, I believe another perspective on this topic could be beneficial to the community. — nbro, Jan 28 '23 at 19:31

How does off-policy monte carlo explore and converge?

0 Answers0