I am trying to find estimates of the typical FLOPs required for training a modern model. The FLOPs used for inference are widely reported and cluster around $10^6$ - $10^7$ for many large models. However, I cannot find equivalent stats for training; the best I can do is guess an order of magnitude based on the number of training epochs times the number of training data points times the inference FLOPs.
When I do this, I estimate that AlexNet and its contemporaries would have required about $10^{18}$ FLOPs.
Is this the right order of magnitude? How many FLOPs do modern networks typically require for all stages of training (I'm also interested in estimates that include hyperparameter tuning)?