I have a neural network that I'm want to use to self-play Connect Four. The neural network receives the board state and is to provide an estimate of the state's value.
I would then, for each move, use the highest estimate, occasionally I will use one of the other moves for exploration.
I intend to use TD($\lambda$) to calculate the errors for each state to backpropagate through the network.
But I'm confused about when this should actually occur. Do I store the estimate of a state that is used and calculate the error based on the next state chosen?
Or do I store a history of all states and backpropagate only when the game is a win/lose/draw?
I guess overall I'm not sure I understand when the update occurs, partially because I don't quite understand how to implement the lambda? Like if I was to apply to back prop after every move, how would I even know the value of lambda at this time-step before I know how long the game will last?
When self-playing, is the error the difference between that "sides" last move? I.e. I compare move 1 against move 3, and move 2 against move 4, etc?