The popular implementations of ViTs by Ross Wightman and Phil Wang add the position embedding to the class tokens as well as to the patches.
Is there any point in doing so?
The purpose of introduction positional embeddings to the Transformer is clear - since in the original formulation Transformer is equivariant to permutations of tokens, and the original task doesn't respect this symmetry one needs to break it in some way to translational symmetry only - and this goal is achieved via the positional embedding (learned or fixed).
However, the class token is somehow distinguished from the other tokens in the image, and there is no notion for him to be located in the [16:32, 48:64]
slice of the image.
Or this choice is simply a matter of convenience? And additional parameter, indeed, has a negligible cost, and there is no benefit as well as harm of the addition of positional embedding to the [CLS] or any special token?