4

I recently came across this function:

$$\sum_{t = 0}^{\infty} \gamma^t R_t.$$

It's elegant and looks to be useful in the type of deterministic, perfect-information, finite models I'm working with.

However, it occurs to me that using $\gamma^t$ in this manner might be seen as somewhat arbitrary.

Specifically, the objective is to discount per the added uncertainty/variance of "temporal distance" between the present gamestate and any potential gamestate being evaluated, but that variance would seem to be a function of the branching factors present in a given state, and the sum of the branching factors leading up to the evaluated state.

  • Are there any defined discount-factors based on the number of branching factors for a given, evaluated node, or the number of branches in the nodes leading to it?

If not, I'd welcome thoughts on how this might be applied.

An initial thought is that I might divide 1 by the number of branches and add that value to the goodness of a given state, which is a technique I'm using for heuristic tie-breaking with no look-ahead, but that's a "value-add" as opposed to a discount.


For context, this is for a form of partisan Sudoku, where an expressed position $p_x$ (value, coordinates) typically removes some number of potential positions $p$ from the gameboard. (Without the addition of an element displacement mechanic, the number of branches can never increase.)

On a $(3^2)^2$ Sudoku, the first $p_x$ removes $30$ out of $729$ potential positions $p$, including itself.

With each $p_x$, the number of branches diminishes until the game collapses into a tractable state, allowing for perfect play in endgames. [Even there, a discounting function may have some utility because outcomes sets of ratios. Where the macro metric is territorial (controlled regions at the end of play), the most meaningful metric may ultimately be "efficiency" (loosely, "points_expended to regions_controlled"), which acknowledges a benefit to expending the least amount of points $p_x$, even in a tractable endgame where the ratio of controlled regions cannot be altered. Additionally, zugzwangs are possible in the endgame, and in that case reversing the discount to maximize branches may have utility.]

$(3^2)^2 = 3x3(3x3) = "9x9"$ but the exponent is preferred so as not to restrict the number of dimensions.

nbro
  • 39,006
  • 12
  • 98
  • 176
DukeZhou
  • 6,237
  • 5
  • 25
  • 53

2 Answers2

1

First, an important note on any form of discounting: adding a discount factor can change what the optimal policy is. The optimal policy when a discount factor is present can be different from the optimal policy in the case where a discount factor is absent. This means that "artificially" adding a discount factor is harmful if we expect to be capable of learning / finding an optimal policy. Generally, we don't expect to be capable of doing that though, except for the case where we have an infinite amount of processing time (which is never in practice). In that other answer which you linked to in your question, I describe that it can still be useful, that it may help to find a good (not optimal, just good) policy more quickly, but it does not come "for free".


I'm not aware of any research evaluating ideas like the one in your question. I am not 100% sure why, I suspect it could be an interesting idea in some situations, but would have to be investigated carefully due to my point above; if not evaluated properly, it could also unexpectedly be harmful.

One thing to note is that the use of discounting factors $\gamma < 1$ is extremely common (basically ubiquitous) in Reinforcement Learning (RL) literature, but rather rare in literature on tree search algorithms like MCTS (though not non-existant; for example, it's used in the original UCT paper from 2006). For the concept of "branching factors", we have the opposite; in RL literature it is very common to consistently have the same action space regardless of states ("constant branching factor"), whereas this is very uncommon in literature on tree search algorithms. So, the combination of discount factors + branching factors is actually somewhat rare in existing literature (which of course doesn't mean that the idea couldn't work or be relevant, it just might explain why the idea doesn't appear to have been properly investigated yet).

One important concern I do have with your idea is that it seems like it could be somewhat of an "anti-heuristic" in some situations (with which I mean, a heuristic that is detrimental to performance). In many games, it is advantageous to be in game states where you have many available moves (a large branching factor), this can mean that you are in a strong position. Consider, for example, chess, where a player who is in a convincingly winning position likely has more moves available than their opponent. I suspect your idea, when applied to chess, would simply promote an aggressive playing style where both players capture as many pieces as possible in an effort to quickly reduce the branching factors across the entire tree.


If not, I'd welcome thoughts on how this might be applied.

(I might divide 1 by the number of branches and add that value to the goodness of a given state, but that's a "value-add" as opposed to a discount.)

Such an additive change would be more closely related to the idea of reward shaping, rather than discounting (which is again a thing that, if not done carefully, can significantly alter the task that you're optimizing for). Intuitively I also suspect it might not do anything at all since you'd always be adding the same constant value regardless of which move you took (regardless of which move you took, your parent state will always have had the same branching factor). I might be missing out on some details here, but I think you'd have to have a multiplicative effect on your regular observed rewards.

I suppose one example that could be worth a try would be something like maximizing the following return;

$$\sum_{t = 0}^{T} \gamma^t \beta^{b(S_{t}) - 1} R_{t + 1},$$

where:

  • $0 \leq \gamma \leq 1$ is the normal time-based discount factor (can set it to $1$ if you like)
  • $0 \leq \beta \leq 1$ is a new branching-factor-based discount factor
  • $b(S_t)$ is the branching factor (the number of available moves) in state $S_t$
  • $R_{t + 1}$ is the immediate reward for transitioning from state $S_t$ to $S_{t + 1}$

I put the $-1$ in the power of $\beta$ because I suspect you wouldn't want to do any discounting in situations where only one move is available. Intuitively, I do suspect this $\beta$ would require very careful tuning though. It is already quite common to choose $\gamma = 0.9$ or $\gamma = 0.99$, with $\beta$ you may want to stay even closer to $1$.

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • Thanks for this--context, analysis *and* formula. I'm going to take some time to work through it. *(Reward shaping is also something I'm working on for this model, which involves an array of competing metrics that naturally arise out of the game structure, including efficiency and multiple types of regional stability.)* re: discount; because the game is "natively finite" (unlike Chess and Go, where special conditions are required) I've been trying to think of positions not just in terms of their scoring value, but in terms of what they "take off the board" in terms of potential responses. – DukeZhou Aug 20 '18 at 16:18
  • 1
    @DukeZhou Note that the formula I wrote there (with the $\beta$) is just pretty much the first one I came up with. It's maybe (highly likely) not the best thing possible, just an idea. It only takes into account the "local" branching factor (of the parent node), not all previously encountered branching factors. Maybe a change that would cause all encountered branching factors to... accumulate in some sense may be better. Using large numbers for powers can explode really quickly though, so gotta be careful there too, maybe something where branching factors are not used as powers is safer. – Dennis Soemers Aug 20 '18 at 16:43
  • 1
    I do think you'll want something multiplicative though, not additive. – Dennis Soemers Aug 20 '18 at 16:43
  • 1
    *(you're 100% about the multiplicative--that was the embarrassing-in-retrospect part of the question revealing my inexperience on the maths side;)* The 1/branches came out of a method for tie-breaking in a purely heuristic evaluation with no lookahead. Where a multiple positions yield a benefit of +R controlled regions, secondary metric may be used for tie-breaking. Idea is to test such methods against other tie-breakers such as efficiency and monte carlo. Project goal at this point is not max strength but a set of automata of ascending strengths for vs. human play. – DukeZhou Aug 20 '18 at 16:51
  • noted that the formula is just off the top of your head. regardless, it's extraordinarily helpful to have it, even as a starting point. – DukeZhou Aug 20 '18 at 16:53
  • 1
    @DukeZhou "Tie-breaking" is actually a very good way to view what we ideally accomplish with discounting. Ideally, we don't change the optimal policy (too much), ideally we just break ties in favour of shorter/more reliable paths. I suppose it's more common to think of tie-breaking as something we do... "horizontally", among all successors of a single parent. Discounting is more like "vertical tie-breaking", breaking ties among states that are at different depth levels in a tree (in the case of time-based discounting). – Dennis Soemers Aug 20 '18 at 16:56
  • 1
    PS- thanks for continuing to reinforce the fundamental concepts and pitfalls. It occurs to me that in 2-player partisan Sudoku with an even number of regions, where the 2nd player is presumed to be disadvantaged, 2nd player may want to pursue nodes with more branches because it increases uncertainty, "leveling the playing field" by diminishing the presumed advantage of the starting player (greater likelihood of mistakes.) Doesn't get around the potential to be anti-heuristic, but opens up the possibility of pursing the max, min or mean number branches, dependent on perceived status. – DukeZhou Aug 21 '18 at 20:02
  • here's a question I hope is meaningful--is the gamma strictly necessary if we use the beta? (this gets back to the idea that temporal distance only is potentially arbitrary, where branches at least reflects the board state. *One of the benefits of the partisan sudoku model over previous combinatorial games seems to be that every element, state, ect. can be precisely quantified.*) – DukeZhou Aug 22 '18 at 17:55
  • 1
    @DukeZhou Nope, not strictly necessary if you're guaranteed not to have infinite-duration episodes. Setting $\gamma = 1$ is equivalent to leaving it out though (since $1$ raised to any power is still $1$). So, might as well leave it in, treat it as a hyperparameter, set it to $1$ first, maybe try to tune it later if you have time. – Dennis Soemers Aug 22 '18 at 18:08
  • Main query is answered, the issue is clarified and broken down with context, and an example is provided with detailed explanation. Couldn't ask for more! *(PS- [I took a stab at describing a possible approach and rationale for deriving the beta](https://ai.stackexchange.com/a/7692/1671). Please feel free to comment:)* – DukeZhou Aug 23 '18 at 20:56
0

Thanks to Dennis Soemers for helping to explain and unpack. I'm still in the early stages of approaching the function, so any thoughts on my current line of thinking would be appreciated.

-----------------------------------------

There are two conditions that need to be separated: number of branches leading to an a given node (raw probability) and number of branches leading from given node (outdegree, which I like to think of as "chaotictivity";)

Raw probability here is the indegree of all nodes leading to the evaluated node.

Aggregate indegree seems for appropriate for a basic time discounting function because it reflects the unqualified probability of any given future state or position.

My thought is that I can use it to "normalize" the reward distributions. An expressed position $p_{x_1}$ yields an aggregate of $30R$ over 20 potential positions $p$; $p_{x_2}$ yields an aggregate of $15R$ over 10 $p$: $$\frac{30}{20} = \frac{15}{10}$$

This "squeezes" the R sum for a given ply into a single number.

The fractions below represent: $\frac{choice}{choices}$

$$\left \{\frac{1}{4} \right \}$$ $$\left \{ \frac{1}{2} | \frac{1}{2} | \frac{1}{2} |\frac{1}{2} \right \}$$ $$\left \{ \frac{1}{1} | \frac{1}{1} \right \} | \left \{ \frac{1}{1} | \frac{1}{1} \right \} | \left \{ \frac{1}{1} | \frac{1}{1} \right \} | \left \{ \frac{1}{1} | \frac{1}{1} \right \}$$

The $\beta$ for any given node on the final ply would be: $$\frac{1}{4} * \frac{1}{2} = \frac{1}{8} = .125$$

Fractions are attractive because integers may be utilized until the $\beta$ needs to be expressed to whatever digit, and the distinct cardinalities of the factors is maintained for ancillary evaluation.

This function should allow the automata to do things like make the choice with the greatest certainty of the least maximal potential downside.

(If greater uncertainty is desirable, the automata can pursue more chaotic nodes, where aggregate least maximal downside is equivalent. In that "personalities" can be desirable in automata that play games with humans, there might be a maximax "gambler" persona that seeks chaos;)

The function can also be used to evaluate individual target nodes, and modified by any number of additional functions for "tuning".

I'm thinking about how I might use the degree of variance between the most probable and least probable node for a given ply. If the least probable node is $\frac{1}{20}$, and most probable node for a given ply is $\frac{1}{5} = \frac{4}{20}$, the variance is $\frac{3}{20} = .15$

This is initially designed for partisan sudoku. I'm thinking it is worthwhile to utilize the native $R$, which is an integer value between $\{-m^n, .., 0, .., m^n\}$ for an $(m^n)^n$ sudoku grid, as it segregates gains and losses.

-------------------------------

EXPERIMENT DESIGN

Partisan sudoku is an attractive model for gauging relative strength of automata because outcomes are a set of ratios.

In the 2-player game where the number of regions in the sudoku is even, the game is perfectly symmetrical and neither player has a material, positional, or turn number advantage. Because the game is intractable on higher order grids, perfectly symmetrical games will not necessarily result in perfect ties because optimality of a given position is only presumed. (The second player can mirror the starting player for a perfect tie, where all outcome ratios are equal, or employ symmetry breaking if it perceives an opponent's position as less optimal than alternate positions.)

In 2-player games with an odd number of regions, matched sets of games may be employed with the automata alternating as starting player.

My initial plan is to evaluate individual heuristics, where one agent employs the branch discount $\beta$, and one agent does not.

DukeZhou
  • 6,237
  • 5
  • 25
  • 53