0

My objective is to test out a new algorithm that I designed. However, I am confused whether my methodology to train the networks is correct.

I am just concerned about the training loops:

In the first algorithm (DIAYN, SAC Based Algorithm), the pseudocode follows a high-level pseudocode of:

Run for N-STEPS:
   1. Run for around 5000 steps
   2. Add in Replay Buffer
   3. After each step, after 5K, choose action using policy
   4. Step in env, collect reward, next_obs ..
   5. Update networks by sampling from replay buffer, batch size of 1025

In the new algorithm, I update the same networks, but in a new manner (which is required for the algorithm to do some other stuff.

Run for n epochs:
   1. Run and collect the 1000 samples of next_obs, reward .. by choosing the action from the policy. 
   2. Then, run some algorithm (this the new addition) to the replay buffer. 
   3. Run a training loop, which runs 1000 times, which updates the networks of a batch size of 128.

As you can see, in the first algo_1, the batch size is 1024, and we collect new samples after each update. Whereas, in algo_2 we update the network 1000 times with replay buffer samples, then we collect new samples.

However, in algo_2, we collect 1000 new samples again. In algo_1, only one sample is added to the replay_buffer after each update. So one new data point is generated from a new updated policy. In algo_1, 1000 samples are generated using the new policy updated 1k times from old replay_buffer.

My question is this, if I wanted to establish a baseline using algo_1, and say that my algo_2 is better as it does X better. Can I do so, if I make sure that the N-STEPS in algo_1 are equal to epochs*1k_training_loop in algo_2?

I apologise for not making this post succinct.

Yash_Bit
  • 1
  • 1
  • Can you please put your **specific question** in the title? "Training Loop for Soft Actor-Critic Algorithms" is not a question and it's also not very specific. – nbro Jul 03 '22 at 20:25
  • Thank you for your input. I hope the details are sufficient now. – Yash_Bit Jul 03 '22 at 22:01

0 Answers0