5

A deterministic policy in the rock-paper-scissors game can be easily exploited by the opponent - by doing just the right sequence of moves to defeat the agent. More often than not, I've heard that a random policy is the optimal policy in this case - but the argument seems a little informal.

Could someone please expound on this, possibly adding more mathematical details and intuition? I guess the case I'm referring to is that of a game between two RL agents, but I'd be happy to learn about other cases too. Thanks!

EDIT: When would a random policy be optimal in this case?

stoic-santiago
  • 1,121
  • 5
  • 18
  • A deterministic policy can be the optimal policy. It entirely depends whether your environment is actively learning to adapt or not. I believe The formal argument will require us to find Nash equilibrium for each player, but making such an equilibrium requires some prior assumption about information (if I recall correctly) each player's possess about others, so I'll not write an answer. But the general direction is this, you can find about it more. –  Aug 27 '20 at 11:12
  • Interesting - if it's two RL agents playing against each other, the deterministic policy is definitely not optimal right? RL agents can adapt actively! – stoic-santiago Aug 27 '20 at 13:48
  • True.....but we don't know.... RL policy can be deterministic if you want..no steadfast rule. –  Aug 27 '20 at 13:56

1 Answers1

6

For this, we will need game theory.

In game theory, an optimal strategy is one that cannot be exploited by the opponent even if they know your strategy.

Let's say you want a strategy where your move selection is not based on what happened before (so you are not trying to model your opponent, or trick them into believing you will always play scissors and then throw them off, anything like that). A strategy will look like $(P, S, R)$, where $P, S, R \in [0, 1], P+S+R = 1$. You select paper with probability $P$, scissors with probability $S$, rock with probability $R$. Now, if your probabilities are a bit uneven (for example $(0.5, 0.2, 0.3)$) an opponent can abuse that strategy. If your opponent plays with probabilities $(p, s, r)$, their expected reward (counting +1 for win, -1 for loss, 0 for draw) would be $0.5(s - r) + 0.2(r - p) + 0.3(p - s) = 0.1p + 0.2s - 0.3r$. If they wish to maximize their wins, they would play scissors all the time against you, and expect to have a distinct advantage over you.

In general, for a strategy $(P, S, R)$ for you and $(p, s, r)$ for your opponent, your opponent's winnings would be $P(s - r) + S(r - p) + R(p - s) = p(R-S) + s(P-R) + r(S - P)$. If all the partial derivatives of this, with respect to $p$, $s$ and $r$ are 0, the opponent has no way to maximize his winnings; they would have no incentive to play a particular move over any other move. This occurs when $P = S = R = \frac13$.

That's basically how to approach game theory: find a strategy so your opponent has no incentive to choose one action over another. The approach seems a bit counter-intuitive at first (you're trying to find the optimal strategy for your opponent instead of for yourself) but it works for many similar problems.

  • 3
    Although I agree with this answer, The answer makes an assumption that we know the opponent's strategy. I think in the RL setting this will be false, and we will get certain upper/lower bound based on the number of samples taken. –  Aug 27 '20 at 17:01
  • @DuttaA This answer doesn't make that assumption anywhere. In fact the "choose equally distributed randomly" strategy will work the same against any kind of strategy. – Helena Aug 27 '20 at 20:40
  • @Helena it did when it calculated the expected reward. The expression is known only when probabilities are known –  Aug 27 '20 at 20:43
  • @DuttaA The expression uses unknown variables for the opponents probabilities (p,s,r). It only uses constant values for the own example strategy (0.5, 0.2, 0.3) . – Helena Aug 27 '20 at 20:47
  • @Helena the example is supposed to illustrate such a thing is possible when in reality if the probabilities are super close such a thing requires a huge number of samples, not to mention the difficulty if we have a non stationary strategy (in the RL setting) –  Aug 27 '20 at 20:49
  • Or maybe you are correct..maybe my idea doesn't stand in the way to prove such a thing....I am not aware of the details. –  Aug 27 '20 at 20:53
  • 4
    @DuttaA You are right, in reality you don't know each others strategy. Because you cannot know what kind of opponent you will get, the best thing you can do is to assume your strongest possible foe. – Helena Aug 27 '20 at 21:00
  • @Helena Assuming a strong foe can make you underperform. You should maximize your expected reward using a probability distribution over the opponents. E.g., when playing with children, you should not use random play. – HappyFace Nov 29 '21 at 09:25