What is the optimal score for Tic Tac Toe for a reinforcement learning agent against a random opponent?

Question

I guess this problem is encountered by everyone trying to solve Tic Tac Toe with various flavors of reinforcement learning.

The answer is not "always win" because the random opponent may sometimes be able to draw the game. So it is slightly less than the always-win score.

I wrote a little Python program to calculate that. Please help verify its correctness and inform me if it has bugs or errors.

@Taw Yes, currently. I may make it an option in the program. — Yan King Yin, Sep 04 '21 at 08:26

score 2 · Answer 1 · answered Sep 04 '21 at 07:20

2

This blog post suggests that when playing against a random opponent, if the agent goes first, the win rate is 97.8%, and if they go second, the win rate is 79.6% (and the rest are draws).

answered Sep 04 '21 at 07:20

Taw

1,161
3
10

Your numbers may be compatible with my answer. I will modify my program to calculate the win/lose rate also.... – Yan King Yin Sep 04 '21 at 08:30
Using your numbers, if agent goes first, 20 * 97.8% + 10 * 2.2% = 19.78 which is different from my answer = 19.9479... – Yan King Yin Sep 04 '21 at 08:39
1

@YanKingYin There may be a difference between perfect play and maximising play against a specific opponent, especially as you have incentivised a draw differently to the usually assumed zero-sum game in your code. Perhaps re-run your code with draw scoring 0? – Neil Slater Sep 04 '21 at 10:37
@NeilSlater: That would be 20 * 97.8% + 0 = 19.56, still different from my program's result = 19.8958333... – Yan King Yin Sep 05 '21 at 09:32
By the way, that blog article's numbers are from "experimental" statistics, so it may be inaccurate. – Yan King Yin Sep 05 '21 at 09:41
@YanKingYin: You need to re-run the experiment if the rewards change, because the optimal policy is different. Again it is also important whether this anwswer references perfect play or optimal play, because **they can be different**. Your experiment approximates optimal play for a given reward structure and given opponent. The blog appears to be using perfect play for its "expert" player. – Neil Slater Sep 05 '21 at 12:56
@NeilSlater: Yes, I reran the experiment with new scoring scheme. What is the difference between "perfect" and "optimal" play in your opinion? Can you explain with an example? – Yan King Yin Sep 07 '21 at 01:55
@YanKingYin Too much for a comment section. If it is of enough interest to you, then please ask a new question on the site. – Neil Slater Sep 07 '21 at 06:19
@NeilSlater: I believe that choosing the maximum expectation value is the same as "perfect play". I have never heard of anything to the contrary. – Yan King Yin Sep 08 '21 at 13:50
1

@YanKingYin They are guaranteed the same when your opponent also plays perfectly (related fact - this is a Nash equilibrium, neither player can benefit from changing their policy). However, against a random opponent, the maximum reward policy and perfect play policy will be different in some states. that is because perfect play is always defensive against assumed perfect play by the opponent, it is about "not making mistakes". If you want to see proof, ask on the site. – Neil Slater Sep 08 '21 at 14:17
1

As a simpler example, you could construct an opponent P2 which was almost perfect, but reliably made a losing mistake in a specific reachable state S, that would normally be a losing state for P1. It is trivial to see that maximising play from P1 would attempt to reach this state, whilst perfect play would do the opposite and avoid it. The perfect player would expect a draw at best, but the maximising player may expect to win. – Neil Slater Sep 08 '21 at 14:25
Hmm.. very interesting. I'm starting to see your point. So if optimal play is maximizing expectation value, how would you define "perfect" play? What kind of objective function does it maximize? – Yan King Yin Sep 11 '21 at 03:41
@NeilSlater: Yes, I should add a question on this site. Hold on.... – Yan King Yin Sep 11 '21 at 06:29

Yan King Yin · Answer 2 · 2021-09-05T09:56:05.560

Here I asssume:

Both players avoid illegal moves perfectly
Player X always chooses the move with maximum expectation value
Player O chooses all available moves with equal probability

Result depends on the scoring scheme:

(This scheme is used in one version of Gym Tic Tac Toe)

For win=20, draw=10, lose=-20:

Optimal expectation value =
- X plays first: 19.94791666666666...
- O plays first: 19.164021164021...
For win=20, draw=0, lose=-20:

Optimal expectation value =
- X plays first: 19.89583333333333...
- O plays first: 18.497354497354....

It also helps to verify the program with some pre-played board positions, included in the code.

Here is the program:

import math     # for math.inf = infinity

print("Calculate optimal expectation value of TicTacToe")
print("from the perspective of 'X' = first player.")
print("Assume both players perfectly avoid illegal moves.")
print("Player 'X' always chooses the move with maximum expectation value.")
print("Player 'O' always plays all available moves with equal probability.")
print("You may modify the initial board position in the code.")

# Empty board
test_board = 9 * [0]

# Pre-moves, if any are desired:
# X|O|
# O|O|X
# X| |
#test_board[0] = -1
#test_board[3] = 1
#test_board[6] = -1
#test_board[4] = 1
#test_board[5] = -1
#test_board[1] = 1

def show_board(board):
    for i in [0, 3, 6]:
        for j in range(3):
            x = board[i + j]
            if x == -1:
                c = '❌'
            elif x == 1:
                c = '⭕'
            else:
                c = '  '
            print(c, end='')
        print(end='\n')

if test_board != 9 * [0]:
    print("\nInitial board position:")
    show_board(test_board)

# **** Calculate expectation value of input board position
def expectation(board, player):

    if player == -1:
        # **** Find all possible next moves for player 'X'
        moves = possible_moves(board)

        # Calculate expectation of these moves;
        # Player 'X' will only choose the one of maximum value.
        max_v = - math.inf
        for m in moves:
            new_board = board.copy()
            new_board[m] = -1       # Player 'X'

            # If this an ending move?
            r = game_over(new_board, -1)
            if r is not None:
                if r > max_v:
                    max_v = r
            else:
                v = expectation(new_board, 1)
                if v > max_v:
                    max_v = v
        # show_board(board)
        print("X's turn.  Expectation w.r.t. Player X =", max_v, end='\r')
        return max_v

    elif player == 1:
        # **** Find all possible next moves for player 'O'
        moves = possible_moves(board)
        # These moves have equal probability
        # print(board, moves)
        p = 1.0 / len(moves)

        # Calculate expectation of these moves;
        # Player 'O' chooses one of them randomly.
        Rx = 0.0        # reward from the perspective of 'X'
        for m in moves:
            new_board = board.copy()
            new_board[m] = 1        # Player 'O'

            # If this an ending move?
            r = game_over(new_board, 1)
            if r is not None:
                if r == 10:             # draw is +10 for either player
                    Rx += r * p
                else:
                    Rx += - r * p       # sign of reward is reversed
            else:
                v = expectation(new_board, -1)
                Rx += v * p
        # show_board(board)
        print("O's turn.  Expectation w.r.t. Player X =", Rx, end='\r')
        return Rx

def possible_moves(board):
    moves = []
    for i in range(9):
        if board[i] == 0:
            moves.append(i)
    return moves

# Check only for the given player.
# Return reward w.r.t. the specific player.
def game_over(board, player):
    # check horizontal
    for i in [0, 3, 6]:     # for each row
        if board[i + 0] == player and \
           board[i + 1] == player and \
           board[i + 2] == player:
            return 20

    # check vertical
    for j in [0, 1, 2]:     # for each column
        if board[3 * 0 + j] == player and \
           board[3 * 1 + j] == player and \
           board[3 * 2 + j] == player:
            return 20

    # check diagonal
    if board[0 + 0] == player and \
       board[3 * 1 + 1] == player and \
       board[3 * 2 + 2] == player:
        return 20

    # check backward diagonal
    if board[3 * 0 + 2] == player and \
       board[3 * 1 + 1] == player and \
       board[3 * 2 + 0] == player:
        return 20

    # return None if game still open
    for i in [0, 3, 6]:
        for j in [0, 1, 2]:
            if board[i + j] == 0:
                return None

    # For one version of gym TicTacToe, draw = 10 regardless of player;
    # Another way is to assign draw = 0.
    return 10

print("\u001b[2K\nOptimal value =", expectation(test_board, -1) )

Example output (for X's turn to play):

There is either a typo in your post, or bug in your code, because if win = 10, draw = 0, lose = -10, it's impossible that the expectation is 19.90. — Taw, Sep 04 '21 at 07:14

score 0 · Answer 3 · answered Jun 13 '23 at 09:23

Actually, you can directly calculate the optimal odds to win in TicTacToe against a random opponent and it turns out that you would win 191 out of 192 games and draw the last one when beginning.

For that -unintuitively- you have to start in a corner, lets say top left:

O . .
. . .
. . .

Then one can see that the only way to draw for the second player is to put the X in the middle (so chance for a random opponent: 1/8). If he doesn't, O can force a win, e.g. after

O X . 
. . .
. . .

O can put the marker in the middle, forcing X on the bottom right and then putting an O on the middle left, creating 2 possibilities for a tic-tac-toe.

One can easily also check the other options for X to go wrong and will confirm that only an X in the middle will hold a draw, but to keep the answer brief I will not proof it here.

Then, we can put our next "O" in the top middle, leading to

O O .
. X .
. . .

where the second player needs to put an "X" in the top right (chance 1/6 for a random opponent)

Then we place our "O" in the bottom left to simultaneously deny "X"'s thread and make a thread on our own:

O O X
. X .
O . .

So at this point a random opponent has a 1/4 chance to save the game. After that, we have to play middle right to draw, as we have no more chance to get any Tic-Tac-Toe even with cooperative play from X.

The chance to win is therefore 1 - 1/8*1/6*1/4 = 191/192.

Remark: One can easily see that this is the best strategy for a win. The total number of moves for "X" are 384 (8 on turn one, then 6, then 4, then 2), out of which with this strategy only 2 ways draw. When "O" starts in the middle, with perfect play all 4 corners can draw for "X". Analogously one can justify every subsequent decision for "O".

What is the optimal score for Tic Tac Toe for a reinforcement learning agent against a random opponent?

3 Answers3

Linked