I have the following question about You Only Look Once (YOLO) algorithm, for object detection.
I have to develop a neural network to recognize web components in web applications - for example, login forms, text boxes, and so on. In this context, I have to consider that the position of the objects on the page may vary, for example, when you scroll up or down.
The question is, would YOLO be able to detect objects in "different" positions? Would the changes affect the recognition precision? In other words, how to achieve translation invariance? Also, what about partial occlusions?
My guess is that it depends on the relevance of the examples in the dataset: if enough translated / partially occluded examples are present, it should work fine.
If possible, I would appreciate papers or references on this matter.
(PS: if anyone knows about a labeled dataset for this task, I would really be grateful if you let me know.)