What I'm missing here is a way to direct the evaluation function to actually winning. For example, a perfect evaluation function for a won position in chess would always return +1
without any hint how to progress towards checkmate. In a chess variant without the fifty-move limit, it could play useless turns forever.
I guess, this is a rather theoretical problem as we won't ever have such a good function, but I wonder if there's a way to avoid it?
It certainly isn't just a theoretical problem, it can occur in practice too. For "large" games this problem won't occur in the early game, but it can start occurring in the later stages of a game when terminal positions can actually be reached directly through exhaustive searches. It can also occur right from the beginning in extremely small / simple games, like Tic Tac Toe.
Addressing this is not (necessarily) just a matter of defining the evaluation function alone, it can also depend on which search or learning algorithm is used. So, I'll consider a few different cases.
Case 1: Minimax / Alpha-Beta / other similar "exhaustive" searches
When using Minimax / Alpha-Beta / other search algorithms based on those, the easiest solution to the problem you describe is to use iterative deepening. As soon as you prove a win for yourself at a certain depth level d
using iterative deepening, you can simply stop the search, don't search if there are any other wins to be proven at depth d + 1
, just play along the line you've just proven to be a winning line. This way, you will always go for the win in the lowest number of moves.
A related advantage of this approach is that, as soon as you prove a loss at depth d
for yourself, you can also cancel the search process and play the move that was best according to your evaluations at depth d - 1
. This will often be your best chance to still grab a win in cases where your opponent fails to see their win at depth d
.
Case 2: Monte-Carlo Tree Search / other searches with randomness
Monte-Carlo Tree Search is a well-known search algorithm that incorporates an element of randomness in its search. With these kinds of algorithms, the problem you describe tends not to be a real issue. Due to the randomness in the search, wins that can be achieved in a small number of moves tend to be evaluated better than longer-distance wins in practice. In long-distance wins, there is a greater chance that the randomness in the search process causes an incorrect move to be played somewhere along the long-distance win, which reduces the evaluation of such a line of play.
Case 3: (Reinforcement) Learning approaches
These approaches tend to involve some element of randomness due to the need for exploration in learning, which leads to similar reasoning as described for MCTS above. Also, in Reinforcement Learning, we typically use a discount factor gamma < 1.0
(e.g. gamma = 0.99
) which causes distant rewards to be viewed as less important than close rewards, even if we don't do such discounting for the final evaluation of the performance of an algorithm. See, for example, a lot of the work on Atari games (DeepMind's DQN, etc.). Algorithms are evaluated according to their undiscounted scores, but learning still uses a bit of discounting because, in practice, this is found to be beneficial for learning.