2

I have the following question about You Only Look Once (YOLO) algorithm, for object detection.

I have to develop a neural network to recognize web components in web applications - for example, login forms, text boxes, and so on. In this context, I have to consider that the position of the objects on the page may vary, for example, when you scroll up or down.

The question is, would YOLO be able to detect objects in "different" positions? Would the changes affect the recognition precision? In other words, how to achieve translation invariance? Also, what about partial occlusions?

My guess is that it depends on the relevance of the examples in the dataset: if enough translated / partially occluded examples are present, it should work fine.

If possible, I would appreciate papers or references on this matter.

(PS: if anyone knows about a labeled dataset for this task, I would really be grateful if you let me know.)

nbro
  • 39,006
  • 12
  • 98
  • 176
giada
  • 21
  • 3
  • think of any classification and localization tasks : objects are never at the same position, but the network is able to find these objects well. Why sould it be different for your task ? :) – Jérémy Blain Aug 20 '18 at 13:57
  • You can check this playlist for the needed information: https://www.youtube.com/playlist?list=PLKHYJbyeQ1a3tMm-Wm6YLRzfW1UmwdUIN – Amr Khaled Apr 22 '21 at 13:23
  • Some authors distinguish between translational invariance (the network's output isn't affected by the object appearing in a different part of the image, at an image classification task) from translational equivariance (the identified bounding box moves along with the moved image). – NikoNyrh Jan 17 '22 at 18:43

3 Answers3

1

As I know about the YOLO, its algorithm splits the whole picture into many small frames and performs classification and boundaries detection at once for every frame, so that the location of the object does not matter.

Shayan Shafiq
  • 350
  • 1
  • 4
  • 12
0

As you said, a CNN would be able to detect objects in different positions if the dataset contains enough examples of such cases, though the network is able to generalize and should be able to detect objects in slightly changed positions and orientations.

The term "translation invariance" does not mean that translating an object in the image would yield the same output for this object, but that translating the whole image would yield the same result. So the relative position of object IS important, modern CNN's takes decisions on the whole image (with strong local cues, of course).

To maximize the ability of your CNN to detect multiple orientation, you can train with data augmentation that rotate the images.

the same reasoning can be applied to partial occlusions: if there are enough samples with occlusion in the training set the network should be able to detect those ones. The network ability to generalize should also help a little when occlusions are small, and still be able to detect the object.

Some papers tried different experiment to demonstrate the robustness to occlusion and translation, for instance by looking at the network activation when artificially occluding a portion of the image with a gray rectangle, though I do not have a paper name in mind.

Louis Lac
  • 308
  • 2
  • 9
0

This is a good question because unlike Faster R-CNN RPN YOLO "proposals" arise from a fully connected layer instead of a convolutional one

The parameters of each "7x7 feature map" in the 7x7x30 YOLO detection layer are not shared which implies an image should be translated to present each of its objects to each of the 7x7 positions to ensure translation invariance

This is related to its extensive requirement for data augmentation including translations, scaling, exposure and saturation where the R-CNN family of algorithms just use random horizontal flips

This was modified in Darknet-19 in which the last layer is convolutional when modified for detection