1

Consider the following algorithm from the textbook titled Reinforcement Learning: An Introduction (second edition) by Richard S. Sutton and Andrew G. Bart

enter image description here

While playing the game for the generation of an episode trajectory, how is the action selected by the agent? I mean, how does the agent selection action $A_i$ from state $S_i$ for $0 \le i \le T-1$. I am getting this doubt as the policy is stochastic in nature and doesn't give a single action as output.

But the algorithm says to generate the episode following the stochastic policy function.

Is it always the action that has the high probability in $\pi(a|s,\theta)$?

Note: The trajectory here is $(S_0,A_0,R_1 \cdots S_{T-1},A_{T-1},R_{T})$,

nbro
  • 39,006
  • 12
  • 98
  • 176
hanugm
  • 3,571
  • 3
  • 18
  • 50
  • 2
    you just sample an action from your current policy. the actions taken will depend on your current version of the policy but actions will be sampled with probability (proportional to, in the case of a continuous action space) given by the policy. this implies that it is more likely that actions with higher probability are more likely to be taken. – David Jan 10 '22 at 16:28

1 Answers1

3

You sample according to the probability distribution $\pi(a \mid s, \theta)$, so you do not always take the action with the highest probability (otherwise there would be no exploration but just exploitation), but the most probable action should be sampled the most. However, keep in mind that the policy, $\theta$, changes, so also the probability distribution. This implementation could be useful, as it shows exactly what I've just said.

I've also seen another implementation that applied some kind of $\epsilon$-greedy policy to sample from $\pi$ (i.e. with probability $\epsilon$ you could choose the greedy action and with probability $1 - \epsilon$ any other action), but I am not sure how common or useful that could be in the basic REINFORCE algorithm. This question seems to be asking about additional exploration strategies in policy gradients, in case you're interested.

nbro
  • 39,006
  • 12
  • 98
  • 176