0

I am comparing my deep neural network (DNN) performance when using 2 types of optimizers: gradient-based Adam (properly tuned) and a population-based optimization algorithm (e.g., genetic algorithm (GA), PSO, etc.). My training dataset is of size >=100,000.

Observation: For GA-trained DNN, I see that it obtains better accuracy than Adam-trained DNN but takes more iterations (hence longer time) than Adam. But only to perform slightly better, GA took 3000 iterations with a population size of 100 whereas Adam took only 100 epochs! I felt the performance gain does not sit right given the longer training time. But before I conclude I want to know if understanding and implementation are correct as given below.

Implementation and problem: While implementing Adam with 100 epochs I use a batch size of 32. Whereas I don't how to use the concept of batch size with GA training but rather the usual initialize population and run some iterations style. Hence for each candidate of the population, the entire 100,000 samples are traversed through before loss (objective function) is computed. Then the same for the second candidate until 100 candidates are done. And this is just for 1 iteration! I am sure this is what increases the training time. But is there any other way to implement this kind of population-based optimizer here?

I feel if I find some time-reducing way I can more easily run these different population-based optimizers and experiments since I see a potential to obtain better predictions with these.

  • Have you tried to make your population-based optimizer be parallel? I guess you can evaluate multiple candidates in the population at the same time, simulating a batch size > 1 (even if the purpose is radically different.) – Luca Anzalone Apr 25 '23 at 19:57
  • I have access to a max of 40 cores and so, I use the multiprocessing command on the objective function that computes the loss value for the 100,000 set. As you correctly thought, this did increase my speed compared to my initial code. But still the situation is as I write in my question. I have seen papers where they mention how easily they used population algorithms with comparable overhead and better accuracy but they didn't provide any code. So I feel there is something wrong with my approach to implementing even though it's technically correct. – knowledge_seeker Apr 25 '23 at 20:01
  • I see, maybe you need a more efficient way to improve candidates. Is feasible for you to do something like done in [PBT](https://arxiv.org/pdf/1711.09846.pdf)? See the `explore` and `exploit` strategies, for example. About the code I have found [this](https://github.com/instadeepai/fastpbrl), have a look. – Luca Anzalone Apr 25 '23 at 20:17
  • @LucaAnzalone Thank you for the links. I read this paper carefully after you suggestion. I see that they propose PBT to jointly optimize the model parameters and hyperparameters. So given a population size, they initialize N number of models with diff hyperparameters and the 'exploit' and 'explore' help to how to use the info from each of the population set of N models. But they also update each compute obj fn. of the each population member in each of the N models. Of course, they do it parallely. – knowledge_seeker Apr 25 '23 at 22:50
  • @LucaAnzalone However, one idea that I get from the paper is maybe I should use GPU to compute the obj. fn value at each member of the population? I get this idea since the paper mentions for one case study 'where each member of the population is a single GPU optimising a Transformer network for 400 × 10^3 steps' and they used 32 population size. Of course I do not have access to GPU currently and the max I can get later is 4 GPUs whereas my population size will be atleast of size 50*2800 i.e. 50 rows and 2800 DNN weights and biases. – knowledge_seeker Apr 25 '23 at 22:53
  • Training on GPU is usually a good idea, I've also seen another paper (the one related to the repo) that uses [Jax](https://github.com/google/jax) to increase the speeup even more. Anyway, what if you initialize the population with Adam to reach good parameters first and then you continue with GA? This may speed-up convergence and reduce total training time. – Luca Anzalone Apr 26 '23 at 17:31

0 Answers0