Short answer: Multimedia implements protocols on top of UDP. They implement the required functionality. Actually, they have more functionality, and are more complex than TCP.
Less short answer: see the answer of @Zac67.
Long answer: I hope that the text below will provide you with a better understanding on this issue.
Transmitting Multimedia over IP
First, we need to differentiate between different types of multimedia. It is important whether there is a pre-recorder video stream (e.g., YouTube) or a live interactive conversation (e.g., skype). I call them real-time multimedia. There is also live broadcast (e.g., twitch) but i am not familiar with it.
TCP is perfectly suitable for transmitting YouTube videos. YouTube does some things to make sure that you do not really see if there was a retransmission.
Real-Time Multimedia
The Problem
Basically the problem is that we have to play around the following fact of life:
Multimedia system ideally should have a mouth-to-ear delay under 200ms for humans to be comfortable. With a mouth-to-ear delay over 400ms human brain does not consider a conversation interactive.
This applies to every system which is used for people to talk - phones, cell phones, etc.
Note that mouth-to-ear means from the time you talked till the time your words are played on the speaker on the other side. This includes not only network delay.
So, basically, one of the goals of multimedia protocols is to work around this limitation.
Why not TCP
Let's see what happens when one packet is lost, but subsequent packets can come. TCP needs to deliver packets in order. Thus TCP buffs the subsequent packets internally, while it signals the source that the packet is missing. Once the missing packet is delivered, all subsequent packets are delivered too. This is called head-of-line blocking.
The problem with real-time multimedia is actually the fact that this retransmission can cause delay >400ms. The way we deal with missing packets also requires these subsequent packets. Thus VoIP cannot use TCP.
Why UDP
Well, we are not using UDP, we are designing protocols that work on top of UDP. UDP has ports. Other than ports UDP provides the same service as IP. I think we could have a different transport protocol instead of UDP, not on top of it. I do not know any reasons why and why not, except for from the point of overhead in packet header it does not really matter.
How to deal with missing packets
Ok, retransmissions are not an option. But we have to somehow deal with missing fragments.
Basically, we have 2 options. Option 1 is to include "redundancy" into the stream (send extra packets) and option 2 is to utilize the properties of voice/video streams to approximately reconstruct the data ("interpolation") or more precisely to show human something feasible instead of the missing packet. AFAIK multimedia streams do a combination of both.
Sending redundant data
The idea is that the sender sends extra packets and, if the one packet is lost, these extra packet can be used to reconstruct the missing packet.
One example of such functionality is a class of methods called forward error correction (FEC). The very basic FEC scheme works as follows. A sender has to send p1, p2, p3, and p4. Then it constructes and send a FEC packet p5 = p1 xor p2 xor p3 xor p4. If the receiver receives any 4 of these 5 packets, it can reconstruct p1 to p4. There are more complicated options that this to handle more than one packet loss.
Note here, that FEC allows you to deal with some number of missing packets. There is no guarantee that at some point there will be more missing packets than was anticipated. Another method is still needed if this happens.
Interpolation
Voice and Video data usually has a lot of similarities between different packets. Or better say - human eye and ear are not that picky. For example, if you say hllo or hllo instead of hello, most people would understand. (I think one packet is less than one sound.) If you are watching a video with 25 fps, and one frame is missing, you probably won't notice it at all. So, there are ways that receivers can deal with missing packets, for example just replay previous packet (video), or do some interpolation between previous and next packet.
Yes, this means that the stream on the receiver may not be the same as the stream on the sender. Multimedia streams can tolerate this.
Note that even if interpolation fails, the human can still request a retransmission aka "could you repeat this", "i didn't hear what you said" and so on. Of course if this happens all the time system will not be usable, but if it happens rarely it is ok.
What is different in non real time multimedia?
Or why can YouTube use TCP?
The thing that is different is that 200ms/400ms delay requirement is not there. There is no strict delay between sending a packet, receiving this packet, and playing audio/video segment in this packet. That is why we just use TCP for convenience but change the application to play around this delay.
YouTube videos are pre-recorded. You can transmit them with the speed that the network allows you. Ideally this happens faster than playback speed.
Also, the video can be (is i think) pre-buffer-ed. You can start playing a video when you have for example 5 seconds of video in the buffer, before playing frame one. So, if there if there is a packet loss, then the player has 5 seconds of video to show, and by this time the packet will be retransmitted. If 5 seconds are too few, then they can be increased.
Note: real-time multimedia also uses similar buffers because of so called jitter (variation of packet interarrival interval), which is not covered here. But because of 200ms delay bound, this buffer can only contain a couple ms worth of data before it goes to speaker/display.
Summary
- Real-time multimedia (interactive! conversations) cannot tolerate retransmissions.
- Because of this we need different ways to deal with missing packets.
- It happens that these ways can not tolerate head-of-line blocking, aka, if the packet is missing, the following received packets need to be delivered and processed.
- Thus we need special protocols that are suited for real multimedia.
Not covered issues
- Multimedia transport protocol does much more than just deal with missing packets.
- Jitter (interval between arrival of different packets varies), this is an issue too.
- TCP congestion control, i.e., the specific way TCP does this, is also not suitable for multimedia. But this is advance topic and require another wall of text like this.