Whilst identities can be viewed strictly as a classification problem, typically at scale they are solved as a regression problem to a description vector, followed by lookup of nearest vectors outside of the AI (e.g. with a database search optimised to find nearest multi-dimensional vectors). To help with this approach, the regression system is trained to minimise distance between pictures of objects with the same identity and maximimse distance between pictures of objects with different identity.
Here's a reference to the triplet loss training approach used to do this.
In brief, the training database requires at least 2 images of each subject. Pairs of images are selected, some identical, some different, and with a bias towards scoring "difficult" pairs. Each image is converted to a description vector by the neural network, and the difference between pairs is used as the loss function (so unlike many supervised learning schemes, the loss is not per function call, but calculated over sets of three function calls).
Prior to use of triplet loss function, it was quite common to have the neural networks learn to predict actual biometrics (e.g. distance between eyes) from images. This had the advantage that the description vector was meaningful, but the disadvantages of needing a lot of difficult data collection, and no guarantee that the chosen biometrics would be good at separating identities in practice, depending on the population.