My objective is to test out a new algorithm that I designed. However, I am confused whether my methodology to train the networks is correct.
I am just concerned about the training loops:
In the first algorithm (DIAYN, SAC Based Algorithm), the pseudocode follows a high-level pseudocode of:
Run for N-STEPS:
1. Run for around 5000 steps
2. Add in Replay Buffer
3. After each step, after 5K, choose action using policy
4. Step in env, collect reward, next_obs ..
5. Update networks by sampling from replay buffer, batch size of 1025
In the new algorithm, I update the same networks, but in a new manner (which is required for the algorithm to do some other stuff.
Run for n epochs:
1. Run and collect the 1000 samples of next_obs, reward .. by choosing the action from the policy.
2. Then, run some algorithm (this the new addition) to the replay buffer.
3. Run a training loop, which runs 1000 times, which updates the networks of a batch size of 128.
As you can see, in the first algo_1, the batch size is 1024, and we collect new samples after each update. Whereas, in algo_2 we update the network 1000 times with replay buffer samples, then we collect new samples.
However, in algo_2, we collect 1000 new samples again. In algo_1, only one sample is added to the replay_buffer after each update. So one new data point is generated from a new updated policy. In algo_1, 1000 samples are generated using the new policy updated 1k times from old replay_buffer.
My question is this, if I wanted to establish a baseline using algo_1, and say that my algo_2 is better as it does X better. Can I do so, if I make sure that the N-STEPS in algo_1 are equal to epochs*1k_training_loop in algo_2?
I apologise for not making this post succinct.