In computer vision what is Panoptic Segmentation about? How it relates to Semantic Segmentation? and how it compares to Instance Segmentation?
1 Answers
Panoptic Segmentation (PS) is a computer vision (CV) task that aims at unifying both semantic and instance segmentation.
- Semantic segmentation: each pixel is labeled with a category ($C$ in total), obtaining a dense prediction over an image. The output image has the same resolution of the original one, but only contains the predicted categories of each pixel. If we consider a specific category $c$ in the output image, we obtain the segmentation mask that ideally corresponds to the area that all the objects belonging to category $c$ occupy. Usually, the categories $C$ depict stuff objects: amorphous regions of same texture and/or material, like the roads and walls. Popular dataset are the PASCAL-VOC2, COCO, ADE20k, and CityScapes. Segmentation models are evaluated with a generalized IoU measure over segmentation masks. Differently from object detection, since there is no overlap among pixels there is no need to apply non-maximum suppression to reject multiple predictions. Common neural network architectures are the FCNs, U-Net, and DeepLab based ones.
- Instance segmentation: is about separating different categories at the pixel level into different instances of the same class. Compared to object detection, instance segmentation provides an accurate segmentation of the object (instead of a rectangular region) that localizes it in the space. Each detected instance of the same category is assigned an ID. Mask-RCNN is a notable approach that builds on the Faster R-CNN object detection model, adding a segmentation branch in the per-region network that predicts the mask of the instance. Turns out that this method can be extended to pose estimation. Instance segmentation cares about detecting things objects, like cars, and people.
Panoptic Segmentation
Panoptic Segmentation unifies stuff and things categories with instances as well. The categories are divided into two classes: stuff (uncountable and/or amorphous objects, like the sky, walls, and road), and things (countable objects, like cars and people). Moreover, ambiguous or out-of-class pixels are assigned a special void label, also each pixel predicted as thing is assigned an instance ID. The Panoptic FPN builds on Mask R-CNN: it uses two FCNs to predict stuff and things pixels, while a mask FCN predicts the segmentation mask as well. It builds on the idea of Feature Pyramid Network (FPN) - introduced in Faster R-CNN - that rescales and merges extracted features at multiple levels (resolutions), to provide enough information for the semantic segmentation predictions.
Panoptic segmentation predictions are evaluated using a single unified metric the Panoptic Quality (PQ), that involves two steps:
- Segment matching: finds the predicted segmentations that match with the ground-truth (GT). A match occurs only if the IoU between the prediction and the GT is above 0.5, having each GT at most one associated segmentation.
- PQ computation: each match is split into true positives (TP), false positives (FP), and false negatives (FN) used in the following formula that is not evaluated for void pixels $$PQ = \frac{\sum_{(p,g)\in TP}IoU(p,g)}{|TP|+\frac12|FP|+\frac12|FN|}$$
To train a panoptic segmentation model is possibile to reuse the dataset that provide both semantic and instance segmentation annotations, like: Cityscapes, ADE20k, and Mapillary Vistas.

- 2,120
- 2
- 13
-
1Are you sure that this Panoptic FPN introduces the idea of "Feature Pyramid Network"? Wasn't this already present in some previous model? It's been many months since I've done anything related to computer vision but I remember these FPNs and they were not introduced, as far as I remember, with this Panoptic FPN (because I don't remember this Panoptic FPN). – nbro May 25 '23 at 22:41
-
@nbro You're totally right, thanks for pointing that out! FPN were first introduced in Faster R-CNN. The Panoptic FPN "reuses" that concept. I'll edit that and provide the reference to faster r-cnn, thanks again. – Luca Anzalone May 26 '23 at 06:51