8

I'm trying to detect the visual attention area in a given image and crop the image into that area. For instance, given an image of any size and a rectangle of say $L \times W$ dimension as an input, I would like to crop the image to the most important visual attention area.

What are the state-of-the-art approaches for doing that?

(By the way, do you know of any tools to implement that? Any piece of code or algorithm would really help.)

BTW, within a "single" object, I would like to get attention. So object detection might not be the best thing. I am looking for any approach, provided it's SOTA, but Deep Learning might be a better choice.

nbro
  • 39,006
  • 12
  • 98
  • 176
Tina J
  • 973
  • 6
  • 13

3 Answers3

2

You can search for the following paper titles:

  1. A Deep Multi-Level Network for Saliency Prediction.
  2. Beyond Universal Saliency: Personalized Saliency Prediction with Multi-task CNN.

You can code in python using Pytorch framework.

varsh
  • 562
  • 7
  • 19
0

"Attention" in neural network (visual) is the area of the image where the network can find most number of features to classify it with high confidence.Based on your description you are talking about "soft attention".

Do we have any tools or SDK to implement that? i don't think there are readymade SDKs available. It is much better to train a model on your dataset with attention. Once you have your base model ready , it is easy to add attention mechanism for it.I suggest you to check https://arxiv.org/pdf/1502.03044.pdf.

0

To get a computer to detect and provide the bounding box or circle around a visual attention area in an image, the basis for attention must be determined. Then the method of getting the computer system to make choices based on that basis can be selected. First things first.

Is it a face or body or game character that is to be the object of interest? Will it be the most dynamic object in the frame in terms of movement? If it is a person, is it always the same person? In either case, will their face be exposed to the angle of the camera? Are there only still shots, or will the images be frames in a movie?

Once you know how YOU would distinguish the object requiring attention from other objects and background, then you can begin to see how a computer might simulate that recognition. When training a deep network that involves convolution kernels (called a CNN or convolutional neural network) and possibly long-short term memory cells (LSTM), there are stages to the recognition.

Usually edges of things are detected first. In movies, the movement of edges are tracked as features of the image. Elements in the image that identify what kind of object the objects are is second. For instance a toy might be detected by the way plastic reflects light and the color types and shapes common to toys. A face might be first recognize by identifying eyes, nose, mouth, chin, and ears.

After parts are identified, then entire objects can be identified through another stage of feature extraction. Vision systems follow the same basic principles of recognition that our human visual system uses.

There are many frameworks and libraries to help with these tasks, but to use them, it is important to get a general picture of the process and to clarify what it is that will sets the objects of importance out from other objects that may be similar or completely different so that attention can be focused the way you want.

Once you have $(h_{min}, v_{min}); (h_{max}, v_{max})$, the coordinates of the two corners of your cropping operation, which would be the goal of your network training, then any image manipulation library could handle the crop.

That's the state of the art. There is no high level SDK that allows one to command the computer to find the most important item in the frame without any clarification of what is meant by that and training operations to teach the software to find what you've decided to be important based on some criteria. Not yet anyway.

Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62