1

I have a neural network that I'm want to use to self-play Connect Four. The neural network receives the board state and is to provide an estimate of the state's value.

I would then, for each move, use the highest estimate, occasionally I will use one of the other moves for exploration.

I intend to use TD($\lambda$) to calculate the errors for each state to backpropagate through the network.

But I'm confused about when this should actually occur. Do I store the estimate of a state that is used and calculate the error based on the next state chosen?

Or do I store a history of all states and backpropagate only when the game is a win/lose/draw?

I guess overall I'm not sure I understand when the update occurs, partially because I don't quite understand how to implement the lambda? Like if I was to apply to back prop after every move, how would I even know the value of lambda at this time-step before I know how long the game will last?

When self-playing, is the error the difference between that "sides" last move? I.e. I compare move 1 against move 3, and move 2 against move 4, etc?

nbro
  • 39,006
  • 12
  • 98
  • 176
NeomerArcana
  • 210
  • 3
  • 12
  • 1
    Are you using forward-looking or backward-looking TD lambda? Forward looking you calculate at the end of an episode, backward-looking you calculate on each step, but needs to add eligibility traces. If not sure, can you point to the reference that you are using? – Neil Slater Nov 27 '17 at 11:28

0 Answers0