EfficientDet outputs classes and bounding boxes. My question is about both but specifically I am interested in the class prediction net part. In the paper's diagram it shows 2 conv layers. I don't understand the code and how it works. And what's the difference between the 2 conv layers of classification and box prediction?
1 Answers
The classification head works as follows.
After the stack of BiFPN
we have a feature map of size B x C x H x W
.
For EfficientDet H
and W
are 1/8
of the input image size.
Then for each pixel in this feature map one applies one convolution to get the bounding boxes. The model predicts n_anchors
- rescaled and shifted versions of reference boxes. The number of output convolution channels is n_anchors x 4
. 4 channels are for the location and scale of each box.
Another convolution predicts the probabilities of a class for each particular location on the grid. The number of output convolution channels is the number of pixels and the outputs are just the logits as for the classification problem.
You have not mentioned the code you're looking at. But I recommend to have a look at rwightman's implementation (if you are familiar with PyTorch).

- 2,454
- 8
- 16