Suppose that we are doing machine translation. We have a conditional language model with attention where we are are trying to predict a sequence $y_1, y_2, \dots, y_J$ from $x_1, x_2, \dots x_I$: $$P(y_1, y_2, \dots, y_{J}|x_1, x_2, \dots x_I) = \prod_{j=1}^{J} p(y_j|v_j, y_1, \dots, y_{j-1})$$ where $v_j$ is a context vector that is different for each $y_j$. Using an RNN with a encoder-decoder structure, each element $x_i$ of the input sequence and $y_j$ of the output sequence is converted into an embedding $h_i$ and $s_j$ respectively: $$h_i = f(h_{i-1}, x_i) \\ s_j = g(s_{j-1},[y_{j-1}, v_j])$$ where $f$ is some function of the previous input state $h_{i-1}$ and the current input word $x_i$ and $g$ is some function of the previous output state $s_{j-1}$, the previous output word $y_{j-1}$ and the context vector $v_j$.
Now, we want the process of predicting $s_j$ to "pay attention" to the correct parts of the encoder states (context vector $v_j$). So: $$v_j = \sum_{i=1}^{I} \alpha_{ij} h_i$$ where $\alpha_{ij}$ tells us how much weight to put on the $i^{th}$ state of the source vector when predicting the $j^{th}$ word of the output vector. Since we want the $\alpha_{ij}$s to be probabilities, we use a softmax function on the similarities between the encoder and decoder states: $$\alpha_{ij} = \frac{\exp(\text{sim}(h_i, s_{j-1}))}{\sum_{i'=1}^{I} \exp(\text{sim}(h_i, s_{j-1}))}$$
Now, in additive attention, the similarities of the encoder and decoder states are computed as: $$\text{sim}(h_i, s_{j}) = \textbf{w}^{T} \text{tanh}(\textbf{W}_{h}h_{i} +\textbf{W}_{s}s_{j})$$
where $\textbf{w}$, $\textbf{W}_{h}$ and $\textbf{W}_{s}$ are learned attention parameters using a one-hidden layer feed-forward network.
What is the intuition behind this definition? Why use the $\text{tanh}$ function? I know that the idea is to use one layer of a neural network to predict the similarities.
Added. This description of machine translation/attention is based on the Coursera course Natural Language Processing.