If an Atari game's rewards can be between $-100$ and $100$, when can we say an agent learned to play this game? Should it get the reward very close to $100$ for each instance of the game? Or it is fine if it gets a low score (say $-100$) at some instances? In other words, if we plot the agent's score versus number of episodes, how should the plot look like? From this plot, when can we say the agent is not stable for this task?
Asked
Active
Viewed 87 times
2
-
1The rewards don't have to define limits of optimality at all (e.g. think of a task with 20 time steps, and some number of them could get this reward - so you would be interested in finding a total reward maybe up to 2000, but the environment structure could make the true limit much lower). Any chance you could focus on a specific example where you want to understand optimality? I don't think there is a general answer here . . . – Neil Slater Jun 06 '18 at 07:19
-
1@NeilSlater I added more details. I specifically want to know about the case where the task is an Attari game. – user491626 Jun 06 '18 at 20:13