Is it a valid assumption that a purely MLP based tic-tac-toe player will learn lookahead strategies?

Question

I'm doing a little tic-tac-toe project to learn neural networks and machine learning (beginner level). I've written a MLP based program that plays with other search based programs and trains with the data generated from the games.

The training and evaluation are strictly policy based - Inputs are board positions and outputs are one-hot encoded array that represents the recommended move for that board position. I've not added search algorithms so that I can understand what to expect from a purely MLP approach.

The MLP model has 35 features and 1 hidden layer and after a few hundred thousands games it has sort of learned to draw 50% games. It has learned the basic stuff like how to block the player from winning and some good board placements.

Now, my question is - It hasn't learned advanced strategies that require making a move that may not be as beneficial for the current move but will improve its chances later. But should I expect that from a strictly policy MLP based no-search approach? Since all that it is being trained on is one board and the next recommended move (even if thousands of those pairs), is it logical to expect it to learn a lookahead approach that goes beyond "the best move for the current board" training?

Put another way, would it be a possible at all for a MLP to learn lookahead without any search strategies? If not, are there any alternatives that can do it without search?

What are those 35 features? And where do you get the recommended move from? — BlindKungFuMaster, Feb 28 '17 at 08:11
35 is the number of neurons in the hidden layer (or feature maps in the convolution version). The recommended moves are the moves made by the winners of games in the training data. The idea being that it tries to figure out what the winner of the games did and extracts patterns and strategies by itself. — Achilles, Feb 28 '17 at 08:20
But we are talking about tictactoe: 3x3 board and 0 and X moves? — BlindKungFuMaster, Feb 28 '17 at 08:23
Yes, that's the one. It's an overkill but I'm using it to understand how neural networks work. — Achilles, Feb 28 '17 at 08:25
You should probably avoid spatial pooling if you use a convnet, because it might throw away the exact position of the Xs and 0s. Also, if your maps are 3x3, that's already the full board, they'll have to be smaller. — BlindKungFuMaster, Feb 28 '17 at 08:36

score 3 · Accepted Answer · answered Feb 28 '17 at 08:19

3

A MLP only does pattern recognition, it will not learn search.

Tictactoe, (Oughts and Crosses), is such a simple game that your network should learn the moves from the training data by heart, no generalisation required. If it still loses games, maybe your training data doesn't consist of particularly good moves.

answered Feb 28 '17 at 08:19

BlindKungFuMaster

4,185
11
23

Thanks. Would be able to extract good moves out of a mix (let's say training data generated from a random player) if the training data consists of hundreds of thousands moves consisting of both good and bad moves? – Achilles Feb 28 '17 at 08:24
It should learn the board positions by heart and learn a probability distribution over the moves made in that position, with the highest probability for the most common move. – BlindKungFuMaster Feb 28 '17 at 08:27
So, if you only learn the moves of the winner, that might be enough to extract good moves, if you learn all moves, it will only learn random moves. – BlindKungFuMaster Feb 28 '17 at 08:31
Ok, so if it consistently sees bad moves, it'll learn bad moves, makes sense. I've been looking for emergent phenomenon where it learns human like strategies while playing with random players. – Achilles Feb 28 '17 at 08:35
I feed in only the winner's moves for training but the winner could be winning by fluke even after making bad moves, that explains it. How does it work in games like chess etc where people say that the AI is coming up with its own strategies and what not? – Achilles Feb 28 '17 at 08:39
Yeah, but good moves should win more often after many random games, so it should actually work. Chess programming is completely different, it's mostly about tree search. AlphaGo uses neural networks, but it has been trained using reinforcement learning and it has an additional tree search component. – BlindKungFuMaster Feb 28 '17 at 08:44
Cool, thanks for the explanation. This (that MLPs would be able to do nothing more than pattern recognition) is true for all other types of neural networks (convolution, dnn, deep belief etc) as well? – Achilles Feb 28 '17 at 08:53
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/54478/discussion-between-blindkungfumaster-and-achilles). – BlindKungFuMaster Feb 28 '17 at 09:00

Is it a valid assumption that a purely MLP based tic-tac-toe player will learn lookahead strategies?

1 Answers1