Can $Q$-learning or SARSA be thought of a Markov Chain?

Question

I might just be overthinking a very simple question but nonetheless the following has been bugging me a lot.

Given an MDP with non-trivial state and action sets, we can implement the SARSA algorithm to find the optimal policy or the optimal state-action-value function $Q^*(s,a)$ by the following iteration:

$$Q(s_t,a_t)\leftarrow Q(s_t,a_t) + \alpha(r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t,a_t)).$$

Assuming each state-action pair is visited infinitely often, fix one such pair $(s,a)$ and denote the time sequence of visiting the said pair as $t_1 < t_2 < t_3 < \dots t_n\dots.$ Also, let $Q_{t_n}(s,a) = X_n$ for ease of notation and consider the sequence of random variables: $$X_0, X_1, \dots X_n,\dots $$

Can $\{X_n\}_{n\geq 0}$ be though of a discrete-time Markov Chain on $\mathbb{R}$ ? My intuition says no, because the recurrence equation will look like: $$X_{n+1} = (1-\alpha)X_n + \alpha(r_{t_n} +\gamma Q_{t_n}(s',a'))$$ and that last term $Q_{t_n}(s',a')$ will be dependent on the path even if we condition on $X_n = x.$

However, I am not quite able to rigorously write an answer in the either direction. I will greatly appreciate if someone can resolve this issue that I am having in either direction.

Can $Q$-learning or SARSA be thought of a Markov Chain?

0 Answers0