Many researchers in deep learning research come up with new CNN architectures.
The architectures are (just) combinations of a few existing layers.
Along with their mathematical intuition, in general, do they visualize intermediate steps by execution and then (do trial and error) brute force for achieving the state-of-art architectures?
Visualizing intermediate steps refers to printing outputs in the proper format for analyzing them. Intermediate steps may refer to feature maps in CNN, hidden states in RNN, outputs of hidden layers in MLP, etc.