Why don't people use nonlinear activation functions after projecting the query key value in attention?

Question

It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.

This observation applies to the transformer, additive attention, etc.

Can you provide an example of someone not using nonlinear activations in their attention? — Philip Raeisghasem, May 04 '19 at 21:53
I'm not sure if I got your question right, for the attention model where exactly would you place the non-linearity? Looking at [Graph Attention Networks](https://arxiv.org/pdf/1710.10903.pdf) by Petar Velickovic, they do apply an activation function in eq. 5. — razvanc92, May 03 '19 at 07:21
I think what he means is that the queries, keys and values are computed as linear projections, i.e. the input is simply multiplied by a matrix, `q = x * W_q`, `k = x * W_k` and `v = x * W_v` respectively. We could use a non-linear function on each of them, `q = σ(x * W_q)` etc., but it is redundant because later on we use the softmax function and at the end a MLP which also has non-linearities in it. — Andreas K., Jul 23 '22 at 08:13

Kostya · Answer 1 · 2023-03-12T13:33:11.777

It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.

Attention is broadly defined as a following operation ($\text{softmax}$ is sometimes replaced by $\tanh$) :

$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where $Q$, $K$ and $V$ are matrices that are some functions of the inputs.
There are three nonlinear operations there:

The inner projection $QK^T$ is nonlinear. We have multiplication of two functions of the inputs. For example, in case of self-attention $Q=X W_Q$ and $K = XW_K$ are two linear transforms of the same $X$, so $QK^T = X \left(W_Q W_K^T\right) X^T$ is a quadratic function of the inputs.
The $\text{softmax}(x_i) = e^{x_i} /\sum_n e^{x_n} $ function is obviously nonlinear ($\tanh$ as well)
The final $\text{softmax}(\dots) V$ product is also nonlinear for the same reasons as (1)

I would say that it is pretty clear that it is definitely not just a linear transformation - there's quite a lot of nonlinearities in the attention block.

This observation applies to the transformer, additive attention, etc.

Let's see what happens next with the outputs of the attention layers:

In the transformer model, outputs of the multi-head-self-attention are fed into a feed-forward network inside each block:

"Feed-forward" means that the inputs are multiplied by a weight matrix and then a nonlinear activation function is applied.

The additive attention approach, directly applies another $\text{softmax}$ on the outputs of what one would call the attention block:

$$e_{ij} = v_a^T \tanh\left(W_as_{i-1} + U_a h_j\right)$$

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$$

To summarize - I don't think that the premise of the question is correct. Various nonlinearities are present both inside the attention blocks and, typically, are applied after the attention is computed.

Self attention does not meat that $Q = K = X$. It only means that $Q=XW_Q$ and $K=XW_K$ or in other words that K and Q are obtained from the same X, as opposed to cross-attention where keys and queries come from different sequences. — hans, Mar 11 '23 at 23:51
@hans I stand corrected, thank you. Edited the answer to reflect that — Kostya, Mar 12 '23 at 13:33

Why don't people use nonlinear activation functions after projecting the query key value in attention?

1 Answers1