I'm reading this paper Global-Locally Self-Attentive Dialogue State Tracker and follow through the implementation published in GLAD.
I was wondering if someone can clarify what variable or score is used to calculate the global and local self-attention scores in Figure 4 (the heatmap).
For me, it is not really clear how to derive these scores. The only score that would match the given dimension would be in the scoring module $p_{utt}=softmax(a_{utt})$. However, I do not see in their implementation that anything is done with this value.
So, what I did was the following:
q_utts = []
a_utts=[]
for c_val in C_vals:
q_utt, a_utt = attend(H_utt, c_val.unsqueeze(0).expand(len(batch), *c_val.size()), lens=utterance_len)
q_utts.append(q_utt)
a_utts.append(a_utt)
attention_score= torch.mean(torch.stack(a_utts,dim=1),dim=1)
But the resulting attention score differs very much from what I expect.