4

I was working with FrozenLake 4x4 from open AI gym. In the slippery case, using a discounting factor of 1, my value iteration implementation was giving a success rate of around 75 percent. It was much worse for the 8x8 grid with success around 50%. I thought in the non slippery case, it should definitely give me 100 percent success, but it turned out to be 0 percent.

After trying to understand what was happening by going through the algorithm on paper (below), I found all values were the same, so greedifying with respect to the value function often just had the agent going around in circles. Reducing the discount factor from 1 to anything below 1 gave me 100 percent success. But going to anything below 0.9 gave me worse results in the slippery case.

Can someone give me some intuition as to why this happened and how to interpret and choose the discount factor for any use case?

Algorithm used (where $\theta = 0.0000001$):

enter image description here

  • Could you please refernce the exact algorithm you used - e.g. tabular Q-learning - plus any other hyperparameters that you set? The learning rate especially is going to interact with the discount rate to determine stability. – Neil Slater May 20 '22 at 07:03
  • Hello @NeilSlater, I have updated the question with the algorithm used and I only set one additional parameter theta = 0.0000001 – ketan dhanuka May 20 '22 at 09:03
  • Value iteration requires a model of the environment - i.e. the state transition probabilities. Gym doesn't give you those via the normal API - how are you getting them? – Neil Slater May 20 '22 at 09:09
  • I get them from the API itself. For this particular environment the API gives the state transition probabilities – ketan dhanuka May 20 '22 at 09:20
  • Do you mean the info block returned *after* taking the action in the `.step` method? How are you iterating the states? – Neil Slater May 20 '22 at 09:28
  • No I am not using the step method. It is an attribute of the environment object itself. Calling env.P[s][a] returns the Probability of transitioning to state s_dash from s after taking action a along with the return. The states are just indexed from 0 to 15 so I am iterating over them using a simple for loop. – ketan dhanuka May 20 '22 at 09:35

1 Answers1

3

After trying to understand what was happening by going through the algorithm on paper (below), I found all values were the same, so greedifying with respect to the value function often just had the agent going around in circles. Reducing the discount factor from 1 to anything below 1 gave me 100 percent success.

This is expected (and optimal, as defined) behaviour with a discount factor of 1 in the deterministic case. With a reward of 1 at the end, no discounting and and no negative rewards for taking its time, the agent has infinite time to complete the task. So whatever it does, provided it does not step into a hole at any point, is optimal.

The caveat is that when following the policy as a test, you need to use random choice to resolve ties (this is assumed in the theory behind value iteration, that the agent can reach the high rewards). Most simple argmax functions will not do that, so you will need to add some kind of shuffle or other random selection over the maximum results to ensure you are really following the policy that value iteration discovered the state value function for.

The last sentence in the pseudocode is ignoring the possibility that you have this kind of situation, so it is not strictly true in all cases. That is, you cannot reliably build a working deterministic policy directly from the optimal value function when $\gamma=1$ for environments like Frozen Lake.

But going to anything below 0.9 gave me worse results in the slippery case.

I am less sure about what is happening in this case. Given that you successfully found an optimal value function in the deterministic environment, then this may actually be optimal too. Whether or not it is may depend on the map - an optimal agent with low discount factor may prefer not to "run a gauntlet" early with risk of failure even if the overall probability of success is better. That may cause it to prefer a safer start that leads to a tougher end, where it is forced to take more risks overall than would be optimal for a goal of completing the most episodes.

how to interpret and choose the discount factor for any use case?

The discount factor is not a free choice for the agent, and it is not a hyperparameter for the learning method in value iteration. Instead, it is part of the definion for what is and isn't optimal. So you should select it based on the problem definition.

In an episodic environment where the agent's goal is to reach a certain state or states as fast as possible, there are two "natural" formulations:

  • Zero reward for most transitions, "goal" rewards for reaching certain terminal states, and a discount factor below $1$. I think your choice of $0.9$ would be fine, and it is commonly used in the literature for this kind of toy environment.

  • Negative reward for most transitions, "goal" rewards for reaching certain terminal states. If there is a single "target" state, such as exiting a maze, the "goal" reward can be zero. Usually no discounting, but you could add discounting if it represents something else about the problem.

In more complex systems, the discount factor may also become a hyperparameter of the agent. For instance, it does so when using function approximation with temporal difference learning, because it can dampen positive feedback from bootstrapping. In those cases, where "naturally" there would be no discounting, then you may want to use the highest value that prevents the estimator diverging e.g. $\gamma = 0.99$.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60