1

In the paper - "Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems", on page 1083, on the 6th line from the bottom, the authors define expectation of the empirical model as $$\hat{\mathbb{E}}_{s,s',a}[V(s')] = \sum_{s' \in S} \hat{P}^{a}_{s, s'}V(s').$$ I didn't understand the significance of this quantity since it puts $V(s')$ inside an expectation while assuming the knowledge of $V(s')$ in the definition on the right.

A clarification in this regard would be appreciated.

EDIT: The paper defines $\hat{P}^{a}_{s, s'}$ as, $$\hat{P}^{a}_{s, s'} = \frac{|(s, a, s', t)|}{|(s, a, t)|}.$$ Where $|(s, a, t)|$ is the number of times state $s$ was visited and action $a$ was taken and $|(s, a, s', t)|$ as the number of times among the $|(s, a, t)|$ times $(s, a)$ was visited when the next state landed in was $s'$ during model learning.

No explicit definition for $V$ is provided however, $V^{\pi}$ is defined as the usual expected discounted return, using the same definition as Sutton and Barto or other sources.

ijuneja
  • 78
  • 8
  • I think I can interpret this if the LHS was $\mathbb{\hat{E}}_{s,a}[V(s')]$ . . . does the paper definitely show $\mathbb{\hat{E}}_{s,s', a}$? Does the paper define $V$ in this context? – Neil Slater Jul 21 '20 at 13:20
  • @NeilSlater the paper does use $s, s', a$ in the notation. have edited to add details – ijuneja Jul 21 '20 at 13:39

1 Answers1

1

If I understand your question correctly, the significance of this is due to the fact that $s'$ is random. In the RHS of the equation it is assumed that $V(\cdot)$ is known for each state, but the quantity is measuring the expected value of the next state given the current state and action.

harwiltz
  • 1,091
  • 1
  • 6
  • 6