Is attention useful only in transformer/convolution layers? Can I add it to linear layers? If yes, how (on a conceptual level, not necessarily the code to implement the layers)?
Asked
Active
Viewed 98 times