Is this aggregation of multiple convolutions of the same input a type of attention or dynamic convolution?

Question

Are there any examples of people performing multiple convolutions at a single depth and then performing feature max aggregation as a convex combination as a form of "dynamic convolutions"?

To be more precise: Say you have an input x, and you generate

Y_1 = conv(x) 
Y_2 = conv(x)
Y_3 = conv(x)

Y = torch.cat([Y_1,Y_2,Y_3]) 
Weights = nn.Parameter(torch.rand(1,3)) 
Weights_normalized = nn.softmax(weights) 
Attended_features = torch.matmul(Y, weights_normalized.t())

So, essentially, you are learning a weighting of the feature maps through this averaging procedure.

Some of you may be familiar with the "Dynamic Convolutions" paper. I’m just curious if you all would consider this dynamic convolution or attention of feature maps. Have you seen it before?

If the code isn’t clear, this is just taking an optimized linear combination of the convolution algorithm feature maps.

Is the paper you're referring to [Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference](https://arxiv.org/abs/1912.03203)? — nbro, Mar 09 '22 at 09:53
@nbro exactly the paper I am referencing but they are aggregating on the weights of the conv layer. — ADA, Mar 09 '22 at 18:26

score 1 · Answer 1 · answered Mar 08 '22 at 12:32

I wouldn't call it nor attention nor dynamic convolution.

Reason being that everything is static. if for conv(x) you refer to a standard convolution then that would imply a static kernel, so nothing fancy going on there but just a classic multichannel CNN, and adding 3 learnable parameters is basically just adding a linear layer (not dense) on top of those features. So in inference phase one of those parameters will be higher than the others and the features coming from the convolution, associated with that parameter, will overcome the others. Which doesn't really sound like attention.

The closest paper to what you're suggesting I can think about is: U-GAT-IT: Unsupervised generative attentional notworks with adaptive layer instance normalization for image-to-image translation.

You can see in the image below that in their generator the authors apply individual weights to each feature map, but the crucial difference is that these weights are not just extra initialized parameters, they come from an auxiliary classifier trained precisely to generate attention masks, idea taken from Learning Deep Features for Discriminative Localization (from which I took the second image).

Mateen Ulhaq · Answer 2 · 2022-09-27T08:07:45.200

See Dynamic Convolution: Attention over Convolution Kernels by Yinpeng Chen et al.

The convolution kernels are generated by taking a weighted average of K=4 kernels. The weights are determined non-linearly via channel attention (i.e. "excitation" in SE networks) that uses global average pooling and a dense network.

The paper also discusses some "training tricks", namely by limiting the space of possible weights via a softmax with "temperature" of T=30 to soften the max further.

Is this aggregation of multiple convolutions of the same input a type of attention or dynamic convolution?

2 Answers2