I have implemented an AI agent to play checkers based on the design written in the first chapter of Machine Learning, Tom Mitchell, McGraw Hill, 1997.
We train the agent by letting it plays against its self.
I wrote the prediction to get how good a board is for white, so when the white plays he must choose the next board with the maximum value, and when black plays it must choose the next board with the minimum value.
Also, I let the agent explore other states by making it choose a random board among the valid next boards, I let that probability to be equal to $0.1$.
The final boards will have training values:
100 if this final board is
win
for white.
-100 if this final board is
lose
for white.
0 if this final board is
draw
.
The intermediate boards will have a training value equal to the prediction of the next board where it is white turn.
The model is based on a linear combination of some features (see the book for full description).
I start by initializing the parameters of the model to random values.
When I train the agent, he lost against himself always or draw in a stupid way, but the error converges to zero.
I was thinking that maybe we should let the learning rate smaller (like 1e-5), and when I do that the agent learns in a better way.
I think that this happened because of the credit assignment problem, because a good move may appear in a loose game, therefore, considered a loose move, so white will never choose it when he plays, but when we let the learning rate to be very small that existence of a good move in a losing game will change its value by a very little amount, and that good move should appear more in win games so its value converges to the right value.
Is my reasoning correct? and if not so what is happened?