2

I implemented the Q-learning algorithm to play tic-tac-toe. The AI plays against the same algorithm, but they don't share the same Q matrix. After 200,000 games, I still beat the AI very easily and it's rather dumb. My selection is made by epsilon greedy policy.

What could cause the AI not to learn?


Edit

Here is how I do it (pseudo code):

for(int i = 0; i < 200000; ++i){
    //Game is restarted here
    ticTacToe.play();
}

And in my ticTacToe I have a simple loop :

while(!isFinished()){
    swapPlaying(); //Change the players' turn
    Position toPlay = playing.whereToMove();

    applyPosition(toPlay);
    playing.update(toPlay);
}

//Here I just update my players whether they won, draw or lost.

In my players, I select the move with epsilon-greedy implemented sa below :

Moves moves = getMoves(); // Return every move available
Qvalues qValues = getQValues(moves); // return only qvalues of interest
//also create the state and add it to the Q-matrix if not already in.

if(!optimal) {
     updateEpsilon(); //I update epsilon with simple linear function epsilon = 1/k, with k being the number of games played.
     double r = (double) rand() / RAND_MAX; // Random between 0 and 1
     if(r < epsilon) { //Exploration
         return randomMove(moves); // Selection of a random move among every move available.
     }
     else {
         return moveWithMaxQValue(qValues);
     }
} else { // If I'm not in the training part anymore
     return moveWithMaxQValue(qValues);
  }

And I update with the following :

double reward = getReward() // Return 1 if game won, -1 if game lost, 0 otherwise
double thisQ, maxQ, newQ;
Grid prevGrid = Grid(*grid); //I have a shared_ptr on the grid for simplicity
prevGrid.removeAt(position) // We remove the action executed before

string state = stateToString(prevGrid);
thisQ = qTable[state][action];
mawQ = maxQValues();

newQ = thisQ + alpha * (reward + gamma*maxQ - thisQ);
qTable[state][action] = newQ;

As mentioned above, both AI have the same algorithm, but they are two distinct instances, so they don't have the same Q-matrix.

I read somewhere on Stack Overflow that I should take in account the movement of the opposite player, but I update a state after player move and opponent move, so I don't think it's necessary.

nbro
  • 39,006
  • 12
  • 98
  • 176
Irindul
  • 39
  • 7
  • 1
    are you sure of the implementation? tic tac toe has a very limited set of alternative scenarios and Q-learning can even "memorize" the best game policy after a few number of iterations – Alireza Apr 12 '17 at 18:37
  • Well I'm not very sure, but when it plays against me, it learns how to avoid loosing against my strategy. But it still need a couple of training set against me to properly learn. I thought I could have a self-trained program with both AI having a Q-learning algorithm. I figured something was wrong in the training part rather than the implementation. Maybe I can train it against an algorithm with minimax ? – Irindul Apr 16 '17 at 15:07
  • I guess minimax would not be a good opponent for your algorithm; while minimax always plays the best move, so your QL algorithm always lose. and you know, QL learns from losing and winning, not only losing. self trained QLs should work. Check your explore-exploit ratio. It should start from around 0.95 and decrease over time. – Alireza Apr 16 '17 at 20:07
  • I actually have a fixed ratio, I will try to make it decrease over time. – Irindul Apr 18 '17 at 09:25
  • I've checked, and after 200,000 games, the AI has visited only 3000 states approximately, this mean there is a problem in my implementation doesn't it ? – Irindul Apr 18 '17 at 14:11
  • maybe it's useful to add your pseudo-code in your question. I guess you can use sigmoid function instead of linear function for explr/explt ratio, to remain more on the exploring behavior and experience more states and then rapidly pass the transient phase and switch to exploit mode, if this is the problem – Alireza Apr 19 '17 at 04:43
  • I edited my post as asked, I didn't try with sigmoid, I'll see if it does change something, but I'm starting to believe my implementation is somewhat wrong. – Irindul Apr 19 '17 at 10:45
  • @Alireza minmax should still work in this case. Due to Tic Tac Toe being very limited a random player should still draw 2%-3% of games against a optimal min-max player when going second. Considerably more going first. – user12889 Oct 30 '17 at 03:34
  • @user12889 as there are three states (win, lose, tie) your algorithm should experience all three. losing and drawing will not help. in most cases letting RL algorithms play against themselves is a good strategy since in this case the power is equal between sides – Alireza Oct 30 '17 at 11:37
  • @Alireza Good point. Learning by playing against itself however has a bit of a reputation for being quite slow learning and often not working well at all (although in some cases, e.g. Google Alpha, it seems to work exceedingly well). Maybe select randomly between a range of possible opponents... – user12889 Oct 31 '17 at 22:10

1 Answers1

2

I couldn't add a comment because of my low reputation but you can check this. It is about the state space. https://math.stackexchange.com/questions/485752/tictactoe-state-space-choose-calculation

Molnár István
  • 694
  • 1
  • 7
  • 11
  • So there are actually 5000 states, but after a quick correction on my program, I only got 300 states visited.. – Irindul Apr 19 '17 at 14:01
  • I didn't go through your code, but maybe you should try anniling the epsilon less in every iteration – Molnár István Apr 19 '17 at 16:00
  • I'd recommend checking out this article (http://brianshourd.com/posts/2012-11-06-tilt-number-of-tic-tac-toe-boards.html) on the number of states in tic-tac-toe. From my understanding of it, a reinforcement learning (Q-learning) AI program would need to know that there are 593 board states. – Daniel G Dec 28 '17 at 00:43
  • @DanielG From what I know about Q learning, you can start with an empty Q matrix and just grow it as the bot learns. – Alexus Mar 12 '18 at 23:23
  • This is an old question/answer, but could you edit this answer to provide more details? – nbro Apr 08 '22 at 10:00