5

I am trying to solve for $\lambda$ using temporal-difference learning. More specifically, I am trying to figure out what $\lambda$ I need, such that $\text{TD}(\lambda)=\text{TD}(1)$, after one iteration. But I get the incorrect value of $\lambda$.

Here's my implementation.

from scipy.optimize import fsolve,leastsq
import numpy as np



 class TD_lambda:
        def __init__(self, probToState,valueEstimates,rewards):
            self.probToState = probToState
            self.valueEstimates = valueEstimates
            self.rewards = rewards
            self.td1 = self.get_vs0(1)

        def get_vs0(self,lambda_):
            probToState = self.probToState
            valueEstimates = self.valueEstimates
            rewards = self.rewards
            vs = dict(zip(['vs0','vs1','vs2','vs3','vs4','vs5','vs6'],list(valueEstimates)))

            vs5 = vs['vs5'] + 1*(rewards[6]+1*vs['vs6']-vs['vs5'])
            vs4 = vs['vs4'] + 1*(rewards[5]+lambda_*rewards[6]+lambda_*vs['vs6']+(1-lambda_)*vs['vs5']-vs['vs4'])
            vs3 = vs['vs3'] + 1*(rewards[4]+lambda_*rewards[5]+lambda_**2*rewards[6]+lambda_**2*vs['vs6']+lambda_*(1-lambda_)*vs['vs5']+(1-lambda_)*vs['vs4']-vs['vs3'])
            vs1 = vs['vs1'] + 1*(rewards[2]+lambda_*rewards[4]+lambda_**2*rewards[5]+lambda_**3*rewards[6]+lambda_**3*vs['vs6']+lambda_**2*(1-lambda_)*vs['vs5']+lambda_*(1-lambda_)*vs['vs4']+\
                                (1-lambda_)*vs['vs3']-vs['vs1'])
            vs2 = vs['vs2'] + 1*(rewards[3]+lambda_*rewards[4]+lambda_**2*rewards[5]+lambda_**3*rewards[6]+lambda_**3*vs['vs6']+lambda_**2*(1-lambda_)*vs['vs5']+lambda_*(1-lambda_)*vs['vs4']+\
                                (1-lambda_)*vs['vs3']-vs['vs2'])
            vs0 = vs['vs0'] + probToState*(rewards[0]+lambda_*rewards[2]+lambda_**2*rewards[4]+lambda_**3*rewards[5]+lambda_**4*rewards[6]+lambda_**4*vs['vs6']+lambda_**3*(1-lambda_)*vs['vs5']+\
                                        +lambda_**2*(1-lambda_)*vs['vs4']+lambda_*(1-lambda_)*vs['vs3']+(1-lambda_)*vs['vs1']-vs['vs0']) +\
                    (1-probToState)*(rewards[1]+lambda_*rewards[3]+lambda_**2*rewards[4]+lambda_**3*rewards[5]+lambda_**4*rewards[6]+lambda_**4*vs['vs6']+lambda_**3*(1-lambda_)*vs['vs5']+\
                                        +lambda_**2*(1-lambda_)*vs['vs4']+lambda_*(1-lambda_)*vs['vs3']+(1-lambda_)*vs['vs2']-vs['vs0'])
            return vs0

        def get_lambda(self,x0=np.linspace(0.1,1,10)):
            return fsolve(lambda lambda_:self.get_vs0(lambda_)-self.td1, x0)

The expected output is: $0.20550275877409016$, but I am getting array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

I cannot understand what am I doing incorrectly.

TD = TD_lambda(probToState,valueEstimates,rewards)
TD.get_lambda()
# Output : array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

I am just using TD($\lambda$) for state 0 after one iteration. I am not required to see where it converges, so I don't update the value estimates.

nbro
  • 39,006
  • 12
  • 98
  • 176
Amanda
  • 205
  • 1
  • 5
  • What is the order of your states in `valueEstimates` and order of rewards in `rewards` . Does index match the state ? For example does index 0 represent value of state 0 or are they flipped. I would say you flipped them since state 6 has value of -6.5 which makes no sense since state 6 is the last state and there are no rewards after it so it should have value of 0. – Brale May 20 '19 at 12:15
  • @Brale_ `rewards` represent the vector of rewards `{r0, r1, r2, r3, r4, r5, r6}`. Likewise for `valueEstimates` – Amanda May 20 '19 at 12:21
  • Is your update correct? why are you adding the discounted value of V(s5) V(s4) ... To each of the elements? Shouldn't it be just a factor of the rewards, and the value of the last element in the episode? – Miguel Saraiva May 20 '19 at 12:31
  • @MiguelSaraiva I think so. It will be helpful to have your answer. I could then test some inputs. – Amanda May 20 '19 at 12:33

2 Answers2

4

$TD(\lambda)$ return has the following form: \begin{equation} G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n} \end{equation} For you MDP $TD(1)$ looks like this: \begin{align} G &= 0.64 (r_0 + r_2 + r_4 + r_5 + r_6) + 0.36(r_1 + r_3 + r_4 + r_5 + r_6)\\ G &\approx 6.164 \end{align} $TD(\lambda)$ looks like this: \begin{equation} G_0^\lambda = (1-\lambda)[\lambda^0 G_{0:1} + \lambda^1 G_{0:2} + \lambda^2 G_{0:3} + \lambda^3 G_{0:4} + \lambda^4 G_{0:5} ] \end{equation} Now each $G$ term separately: \begin{align} G_{0:1} &= 0.64(r_0 + v_1) + 0.36(r_1 + v_2) \approx 7.864\\ G_{0:2} &= 0.64(r_0 + r_2 + v_3) + 0.36(r_1 + r_3 + v_3) \approx -5.336\\ G_{0:3} &= 0.64(r_0 + r_2 + r_4 + v_4) + 0.36(r_1 + r_3 + r_4 + v_4) \approx 25.864\\ G_{0:4} &= 0.64(r_0 + r_2 + r_4 + r_5 + v_5) + 0.36(r_1 + r_3 + r_4 + r_5 + v_5) \approx -11.936\\ G_{0:5} &= 0.64(r_0 + r_2 + r_4 + r_5 + r_6 + v_6) + 0.36(r_1 + r_3 + r_4 + r_5 + r_6 + v_6) \approx -0.336 \end{align} Finally, we need to find $\lambda$ so that the return is equal to $TD(1)$ return, we have: \begin{equation} 6.164 = (1 - \lambda)[7.864 - 5.336\lambda + 25.864\lambda^2 - 11.936\lambda^3 - 0.336\lambda^4] \end{equation} When you solve this equation, one of the solutions is $0.205029$ which is close to what you needed to get considering the numerical errors. Your problem was that you only considered probability in first state, but that decision prolongs to future states as well.

EDIT

As pointed out by bewestphal, this is not a full solution, it misses one crucial step to get it fully correct. Hint for it can be found in his answer and that's the correct solution.

Brale
  • 2,306
  • 1
  • 5
  • 14
  • values start from $v_1$ because the problem was to find return in state 0 which does not include the value $v_0$. Rewards start from $r_0$ simply because indices were set that way ,if all indices were moved by 1 we would have started with $r_1$. – Brale May 20 '19 at 13:54
  • As for the code, it would be as simple as rewriting these equations to some programming language. You would have two arrays, one for rewards `r = [-2.4, 9.6, -7.8, 0.1, 3.4, -2.1, 7.9]` and one for values `v = [0.0, 4.9, 7.8, -2.3, 25.5, -10.2, -6.5]` and then you would write equations, for example $TD(1)$ would be `G = 0.64*(r[0] +r[2] + r[4] + r[5] + r[6]) + 0.36*(r[1] + r[3] + r[4] + r[5] + r[6])` and same logic would apply for other equations. – Brale May 20 '19 at 13:54
  • How did you get vector with 10 solutions? Equation is a 5th order polynomial, you should get 5 solutions, some of them should be probably complex numbers. I used Wolfram Alpha website to solve the equation in the answer, try with that. – Brale May 21 '19 at 06:14
  • Also, you modified the numbers in your question, now my answer seems irrelevant and people may wonder where did I pull out these numbers from and how did I conclude that solution is what you needed to get. Please don't do that, rather post another question – Brale May 21 '19 at 06:24
  • I wouldn't really know, sorry – Brale May 21 '19 at 06:51
  • https://ai.stackexchange.com/questions/21187/why-isnt-my-implementation-of-a2c-for-the-the-atari-pong-game-converging with bounty 150 points – jgauth May 18 '20 at 14:17
1

The previous answer from Brale is mostly correct but is missing a large detail to get the precise answer.

Given this is a question from a GT course homework, I only want to leave pointers so those seeking help can understand the required concept.

() equation is a summation over infinite K-steps (0:1 -> 0:∞) and should be included in our equation of () = (1)

Every k-step estimator which included steps past the termination point will equal the sum of the rewards.

Including these values into the summation will show a pattern, making infinite summation equation solvable.

bewestphal
  • 11
  • 1
  • 1
    Thanks for the correction, I'm debating whether I should edit my answer to be fully correct or should I leave it as an exercise for the reader since it's a course homework – Brale May 31 '19 at 06:55
  • @Brale_ I might suggest editing your answer only to reference (and link to) bewestphal's answer, and then formally accepting bewestphal's answer. I believe this would be most fair, and instructive to future readers. – DukeZhou May 31 '19 at 21:02