0

I am implementing A3C for the CartPole environment. I want to compare the results I got from A3C with the ones I got from AC1. The problem is I don't know which process to look at. If I use, let's say, 11 processes, should I take the first one which got to average 495 points (over the last 100 episodes), last one, or should I take mean of all?

I don't want to take the first one that got to 495 since it is using a model that was already updated by the first few processes and it looks like cheating. Does some norm exist I can follow for valid results?

enter image description here

nbro
  • 39,006
  • 12
  • 98
  • 176
  • Hi Leon, and welcome to AI Stack Exchange! Wow, interesting question! Right now, I'm not 100% sure what your question is asking, and I want to be sure before I try to give an answer. For starters, could you please explain the chart in a little more detail? For instance, what exactly do each of the lines represent? Also, I'm used to Cartpole only yielding rewards up to 200. Could you please explain how your mean_reward dependent variable allows for values over 200? Thank you! – DeepQZero May 24 '21 at 15:46
  • 1
    Hello, thank you! So I think you are using Cartpole v0 which is 200 maximum reward, i am using Cartpole v1 which is 500 maximum reward. Chart represents: x axis - mean reward (last 100 episodes) from 0 to 500, y axis - number of episodes. So lower the number of episodes better the algorithm. – Leon Jovanovic May 24 '21 at 16:02
  • Thank helps; thank you! I have two more questions. What algorithm is AC1 - I haven't seen that initialism before? Could you spell out the name or provide a link to a source? Also, I'm still trying to exactly determine what the A3C lines in the graph represent. From my understanding, A3C uses multiple processes to gather data in separate environments. Is each line the average reward from the corresponding process? – DeepQZero May 24 '21 at 16:27
  • 1
    AC1 is Vanilla Actor Critic, i added n-step, entropy and advantage but it is still single process Actor Critic. Yes, each line is one process and its mean reward over last 100 games. – Leon Jovanovic May 24 '21 at 16:37

0 Answers0