4

I want to model an SMDP such that time is discretized and the transition time between the two states follows an exponential distribution and there would be no reward between the transition.

Can I know what are the differences between $Q(\lambda)$ and Q-learning for this problem (SMDP)? I actually want to extend the pseudo-code presented here to an SMDP problem with discretization of time horizon.

nbro
  • 39,006
  • 12
  • 98
  • 176
Amin
  • 471
  • 2
  • 11

1 Answers1

1

If you really just want an SMDP-version of the algorithm, which only needs to be capable of operating on the "high-level" time scale of macro-actions, you can relatively safely take the original pseudocode of whatever MDP-based algorithm you like, replace every occurrence of "action" with "macro-action", and you're pretty much done.

The only caveat I can think of in the case of $Q(\lambda)$ is that the "optimal" value for $\lambda$ is probably somewhat related to the amount of time that expires... so intuitively I'd expect it to be best if the value for $\lambda$ decreases as the amount of time expired during execution of the last macro-action increases. A constant $\lambda$ probably still works fine as well though.


If you actually want your algorithm to also be aware of lower-time-scale MDP underlying an SMDP, and not only treat macro-actions as "large actions" and be done with it... I'd recommend looking into the Options framework. There you get interesting ideas like intra-option updates, which may allow you to also perform learning whilst larger macro-actions (or options) are still in progress.

Last time I looked there hasn't been a lot of work involving the combination of eligibility traces and options, but there has been some: Eligibility Traces for Options. This paper doesn't specifically apply the algorithm you mentioned ($Q(\lambda)$), but it does discuss a bunch of other -- much more recent, and likely better -- off-policy algorithms with eligibility traces.

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • 1
    Dear @Dennis. Can I know if the only differences will be the following: 1) Take action $a$, observe $r, s^{\prime}, \tau$. 2) Updating the error: $\delta \leftarrow r+\gamma^{\tau}Q(s^{\prime}, a^{\ast})-Q(s,a)$. 3) Updating eligibility trace $e(s,a) \leftarrow \lambda \gamma^{\tau} e(s,a)$ – Amin Mar 12 '19 at 19:13
  • 1
    @AminSh Yes that looks correct to me, assuming that $\tau$ is the duration of the last "action" $a$ you've taken (where $a$ may be a "macro-action" that takes more than a single primitive time step). I assume that the $r$ you observe there is also a sum of rewards collected during execution of $a$, with appropriate discounting by $\gamma$ "inside" that sum as well, rather than $r$ being just a single primitive reward. – Dennis Soemers Mar 12 '19 at 19:38
  • 1
    Dear @Dennis I really appreciate you. – Amin Mar 12 '19 at 19:39