I read the Facenet paper and one thing I am not sure about (it might be trivial and I missed it) is how do we give the kick start to the network.
The embeddings, in the beginning, are random, so picking hard (or semi-hard) negatives, based on the Euclidean distance, would give random images in the beginning.
Do we hope that over time this will converge to the actual desired hard images? Is it any reason to expect that this convergence will be attained?