I think you underestimate the size of YOLO. This is the size of one segment of yolo tiny according to the darknet .cfg file:
Convolutional Neural Network structure:
416x416x3 Input image
416x416x16 Convolutional layer: 3x3x16, stride = 1, padding = 1
208x208x16 Max pooling layer: 2x2, stride = 2
208x208x32 Convolutional layer: 3x3x32, stride = 1, padding = 1
104x104x32 Max pooling layer: 2x2, stride = 2
104x104x64 Convolutional layer: 3x3x64, stride = 1, padding = 1
52x52x64 Max pooling layer: 2x2, stride = 2
52x52x128 Convolutional layer: 3x3x128, stride = 1, padding = 1
26x26x128 Max pooling layer: 2x2, stride = 2
26x26x256 Convolutional layer: 3x3x256, stride = 1, padding = 1
13x13x256 Max pooling layer: 2x2, stride = 2
13x13x512 Convolutional layer: 3x3x512, stride = 1, padding = 1
12x12x512 Max pooling layer: 2x2, stride = 1
12x12x1024 Convolutional layer: 3x3x1024, stride = 1, padding = 1
.cfg file found here: https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-tiny.cfg
EDIT: These networks generally aren't specifically designed to train fast, they're designed to run fast at test time, where it matters