After going through both the "Illustrated Transformer" and "Annotated Transformer" blog posts, I still don't understand how the sinusoidal encodings are representing the position of elements in the input sequence.
Is it the fact that since each row (input token) in a matrix (entire input sequence) has a unique waveform as its encoding, each of which can be expressed as a linear function of any other element in the input sequence, then the transformer can learn relations between these rows via linear functions?