In the problems of NLP and sequence modeling, the Transformer architectures based on the self-attention mechanism (proposed in Attention Is All You Need) have achieved impressive results and now are the first choices in this sort of problem.
However, most of the architectures, which appear in the literature, have a lot of parameters and are aimed at solving rather complicated tasks of language modeling ([1], [2]). These models have a large number of parameters and are computationally expensive.
There exist multiple approaches to reduce the computational complexity of these models, like knowledge distillation or multiple approaches to deal with the $O(n^2)$ computational complexity of the self-attention ([3], [4]).
However, these models are still aimed at language modeling and require quite a lot of parameters.
I wonder whether there are successful applications of transformers with a very small number of parameters (1k-10k), in the signal processing applications, where inference has to be performed in a very fast way, hence heavy and computationally expensive models are not allowed.
So far, the common approaches are CNN or RNN architectures, but I wonder whether there are some results, where lightweight transformers have achieved SOTA results for these extremely small models.