3

In the context of Reinforcement Learning, I have seen that the policy $\pi$ (for some algorithms) is nothing but a Neural Network architecture (for example a Feedforward Neural Network).

This policy is usually annotated as $\pi_{\theta}$, suggesting the policy is parameterized by $\theta$.

Question 1: Does this mean that $\theta$ in this case would represent all the Neural Network's parameters?

Question 2: The notation $\pi_{\theta}(a_{t}|s_{t})$ can be interpreted as "the output probability of Neural Network $\pi$ with parameters $\theta$ of selecting action $a_{t}$ when being input the state $s_{t}$"?

Here I leave an example from the Hugging Face RL course which uses this kind of notation.

nbro
  • 39,006
  • 12
  • 98
  • 176
moth123
  • 31
  • 2
  • 1
    Yes. If the policy is represented by a neural network, then both of your statements describe the typical interpretation. – mikkola May 24 '23 at 17:46

1 Answers1

5

Question 1: Does this mean that $\theta$ in this case would represent all the Neural Network's parameters?

Yes. It doesn't have to be a neural network though. Any parametric model for the policy can be described using the same notation.

Question 2: The notation $\pi_{\theta}(a_{t}|s_{t})$ can be interpreted as "the output probability of Neural Network $\pi$ with parameters $\theta$ of selecting action $a_{t}$ when being input the state $s_{t}$"?

Yes, sort of. The neural network does not make the selection, and does not necessarily output probabilities. For example in a DQN a neural network typically outputs estimated Q values for each action, and a separate piece of code will convert that into a policy and take a sample action. So the notation also covers the behaviour of that additional code that works together with the ANN.

The Policy Gradient example that you linked typically has an ANN that outputs either an array of discrete probabilities that the policy code samples one action from, or it can output the parameters of some distribution, e.g. the mean and variance of a normal distribution, which the rest of the policy code will evaluate and sample from to decide an action in a continuous action space.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60