In this paper, how does scaling the filter instead of the image generate saliency maps of the same size and resolution as the input image?

Question

In this paper, in section 3.1, the authors state

Scaling the filter instead of the image allows the generation of saliency maps of the same size and resolution as the input image.

How is this possible?

From what I have understood, the process of filtering the image is similar to that of a convolution operation, like this:

However, if this is true, shouldn't we get different sized outputs (i.e. saliency maps) for different filter sizes?

I think I am misunderstanding how the filtering process really works in that it actually differs from a CNN. I would highly appreciate any insight on the above.

Note: This is a follow-up to this question.

I didn't read the paper, but the verb "to scale" may not be used here to mean "to reduce the size of the filter" but "to normalize the values of the filter", but I could well be wrong. So, my question to you is: what do they mean by "scaling the filter"? In any case, the size of the output of a convolutional layer (often called feature map) depends not only on the filter's dimensions but also on the **stride** and if the image is padded. — nbro, Oct 22 '21 at 13:51
@nbro In the paper, it is stated that "A change in scale is affected by scaling the region $R_{2}$ instead of scaling the image." This leads me to believe that "scaling" does indicate a change in the size of the filter. As for strides and padding, there is no mention of such and so I am not inclined to make assumptions on their existence. — , Oct 22 '21 at 17:50

In this paper, how does scaling the filter instead of the image generate saliency maps of the same size and resolution as the input image?

0 Answers0