To my understanding, this is basically a supervised learning problem, where from the self play we have games associated with their winners, and the network is being trained to map game states to likelihood of winning.
Yes, although the data for this supervised learning problem was provided by self-play. As AlphaZero learned, the board evaluations of the same positions would need to change, so this is a non-stationary problem, requiring that the ML forgot the training for older examples over time.
What part of the game is the network trained to predict a winner on?
Potentially all of it, including the starting empty board. I am not sure if the empty board was evaluated in this way, but it is not only feasible, but can be done accurately in practice for simpler games (Tic Tac Toe and Connect 4 for example), given known player policies.
Obviously after only five moves, the winner is not yet clear, and trying to predict a winner after five moves based on the game's eventual winner would learn a meaningless function.
Not at all. This is purely a matter of complexity and difficultly. In practice at such an early stage, the value network will output something non-committal, such as $p=0.51$ win chance for player 1. And it will have learned to do this, because in its experience during self-play similar positions at the start of the game lead to almost equal numbers of player 1 and player 2 winning.
The function is not meaningless either, it can be used to assess results of look-ahead searches without needing to play to the end of the game. It completely replaces position-evaluation heuristics as used in more traditional game tree searches. In practice, very early position data in something as complex as chess or go is not going to be as useful as later position evaluations, due to ambivalent predictions. However, for consistency it can still be learned and used in the game algorithms.
How is the network trained to understand that, if all it is told is who eventually won?
If a supervised learning technique is given the same input data $X$ that on different examples predicts the labels $A, B, B, B, A, A, B, B$, then it should learn $p(B|X) = 0.625$. That would minimise a cross-entropy loss function, and is what is going on here.