I'm doing a little tic-tac-toe project to learn neural networks and machine learning (beginner level). I've written a MLP based program that plays with other search based programs and trains with the data generated from the games.
The training and evaluation are strictly policy based - Inputs are board positions and outputs are one-hot encoded array that represents the recommended move for that board position. I've not added search algorithms so that I can understand what to expect from a purely MLP approach.
The MLP model has 35 features and 1 hidden layer and after a few hundred thousands games it has sort of learned to draw 50% games. It has learned the basic stuff like how to block the player from winning and some good board placements.
Now, my question is - It hasn't learned advanced strategies that require making a move that may not be as beneficial for the current move but will improve its chances later. But should I expect that from a strictly policy MLP based no-search approach? Since all that it is being trained on is one board and the next recommended move (even if thousands of those pairs), is it logical to expect it to learn a lookahead approach that goes beyond "the best move for the current board" training?
Put another way, would it be a possible at all for a MLP to learn lookahead without any search strategies? If not, are there any alternatives that can do it without search?