To get a computer to detect and provide the bounding box or circle around a visual attention area in an image, the basis for attention must be determined. Then the method of getting the computer system to make choices based on that basis can be selected. First things first.
Is it a face or body or game character that is to be the object of interest? Will it be the most dynamic object in the frame in terms of movement? If it is a person, is it always the same person? In either case, will their face be exposed to the angle of the camera? Are there only still shots, or will the images be frames in a movie?
Once you know how YOU would distinguish the object requiring attention from other objects and background, then you can begin to see how a computer might simulate that recognition. When training a deep network that involves convolution kernels (called a CNN or convolutional neural network) and possibly long-short term memory cells (LSTM), there are stages to the recognition.
Usually edges of things are detected first. In movies, the movement of edges are tracked as features of the image. Elements in the image that identify what kind of object the objects are is second. For instance a toy might be detected by the way plastic reflects light and the color types and shapes common to toys. A face might be first recognize by identifying eyes, nose, mouth, chin, and ears.
After parts are identified, then entire objects can be identified through another stage of feature extraction. Vision systems follow the same basic principles of recognition that our human visual system uses.
There are many frameworks and libraries to help with these tasks, but to use them, it is important to get a general picture of the process and to clarify what it is that will sets the objects of importance out from other objects that may be similar or completely different so that attention can be focused the way you want.
Once you have $(h_{min}, v_{min}); (h_{max}, v_{max})$, the coordinates of the two corners of your cropping operation, which would be the goal of your network training, then any image manipulation library could handle the crop.
That's the state of the art. There is no high level SDK that allows one to command the computer to find the most important item in the frame without any clarification of what is meant by that and training operations to teach the software to find what you've decided to be important based on some criteria. Not yet anyway.