How are the softmax normalized weights in ELMo actually learned and computed?

Asked Mar 19 '23 at 00:59

Active Mar 19 '23 at 00:59

Viewed 22 times

I was reading the ELMo paper, and they speak of task-specific representations of words (or tokens generally speaking) by using the following equation: $ELMo_{k}^{task} = \gamma^{task}\sum_{0}^{L}{s_{j}^{task}h_{k,j}^{LM}}$ where $ELMo_{k}^{task}$ is the representation of the $k-$th token in one of the examples of a given $task$, $\gamma^{task}$ is a scaling factors and helps with the optimization process, $s_{j}^{task}$ are the softmax-normalized weights and $h_{k,j}^{LM}$ are the hidden states of the pretrained model.

$s_{j}^{task}$ are supposed to be learnable parameters, but I don't see how we can learn them. Should there be a "dense" layer with a softmax activation function that takes as input the $h_{k,j}^{LM}$ then outputs the $ELMo_{k}^{task}$?

asked Mar 19 '23 at 00:59

Propr

How are the softmax normalized weights in ELMo actually learned and computed?

0 Answers0