2D convolution with channels versus 3D convolution for layers of a map?

Question

Introduction

I am considering to use a convolutional neural network in implementing Monte Carlo control with function approximation. I am using a Monte Carlo estimate as it is unbiased and has nice convergence properties. Of course, it has high variance and slower convergence.

The goal is to control a game such that episodic reinforcement learning is used. In fact, the game is Mckinsey Plant-defense game. Below is an image of the game:

Problem specification

I have written a rough simulator of the game and then written a function that splits the board into various 14 x 14 layers. More specifically, the layers contain the following:

Layer 1 is a (14,14) array of zeros with a 1 at the location of the plant.
Layers 2-4 are (14,14) arrays of zeros with ones placed at coordinates containing terrain. For example, layer 2 contains only the location of cliffs.
Layers 5-7 are like layers 2-4 but contain locations of defenders, i.e., snake.
Layers 8 and 9 are also (14,14) arrays of zeros. However, if an attacker is present at $(i,j)$ in the grid, a value between zero and 1 is placed there. A such, both the remaining health and location are presented in a layer. Layer 8is for the foxes and layer 9 is for the groundhogs.
The grid of the actual game varies. Sometimes it's a (10,10) grid, (14,10), (12,12), etc. As such, I cast these grids onto a (14,14) grid. The last layer is thus a (14,14) array of zeros with ones placed in locations that do not exist. In other words, the ones indicate where I have padded the original grid. For example, a (12,12) grid would have a border of ones.

Naturally, there is spatial interaction in each layer but also interaction between layers. I am considering two approaches to a convolution architecture. Note, I am using PyTorch, so I apologise if I am using PyTorch-specific terminology.

Approaches: 2D (with channels) vs 3D (no channels) convolution

Approach 1

Approach 1 uses 2D convolution with channels.

$$ y_{h,i,j} = \sum_{k=1}^{C_{in}}\sum_{l=1}^{H_k}\sum_{m=1}^{W_k}w_{h,k,l,m}x_{k,i+l-1,j+m-1} $$ Where $h=1,2,\dots,C_{out}$ are output channels, $k=1,2,\dots,C_{in}$ is the input channels, $l=1,2,\dots,K_H$ is the height of the kernel, $m=1,2,\dots,K_W$ is the width of the kernel, $x$ is the layers of the map and $y$ is the output of the convolution. In PyTorch's Conv2d function, if I set groups=1 then all input channels are convolved with all output channels as in the formula. Does this capture interaction between channels?

Approach 2

The second approach would instead treat the 10 stacked layers as a 3D image of shape (10,14,14) with one channel. The formula for this would read as below:

$$ y_{i,j,k} = \sum_{m=1}^{D_k}\sum_{n=1}^{H_k}\sum_{o=1}^{W_k}w_{m,n,o}x_{i+m-1,j+n-1,k+o-1} $$ It would seem that this would capture more of the dependencies between layers? I would imagine that this is the approach they would use in medical imaging where the image has layers that strongly interact.

Question

Which approach is most suitable for modelling dependencies within layers and between layers? Can you please provide me with some intuition on why? I would also not mind reading recommended literature. Thank you for your time.

pi-tau · Accepted Answer · 2023-07-25T19:30:32.170

First of all, I don't think that the two approaches are the same as @lev1248 claims. When using a 3D convolution the 3d filters have depth equal to kernel_size and you slide your filter along the $D$ dimension as well.

I think that Approach 1 is better because you want to pay attention to all of the channels while sliding along the 14x14 plane. Note that the convolutional layer was designed to take advantage of translational invariance. So neurons are activated if they see something (plant, terrain, etc.) regardless of its location. The sliding kernel will slide along the 14x14 plane and whenever it "sees" something it will activate. Now regarding the channel dimension, you actually have a one-hot encoding of what exactly is present at a given location, as I understand 10 different options. You wouldn't want to slide a kernel along this dimension and output the same activation regardless of where it sees the $1$. You want to take the contents of the entire channel into consideration.

If you set groups=1 in the Conv2d layer initializer (which is the default option) then you get the standard normal convolution. This means that each of the output channels contains the information from all of the input channels. The kernel is convolved with the input, channels are not convolved. So, yes, this captures the interaction.

The other extreme would be to set groups=in_channels (which is 10 in your case). Then you split your input tensor (10, 14, 14) into 10 input tensors of size (1, 14, 14), you perform the convolution along each one separately and you concatenate the results. In this case you do not capture the interaction between the channels. Usually, when you do this, you add an additional 1x1 convolutional layer after that that will intermix the channels. This approach is called a depthwise-separable convolution and was introduced in the MobileNet paper. This will harm the accuracy of the model, but greatly reduces the number of parameters and is the preferred approach for models that will be deployed on mobile phones and other edge devices.

You can read more about splitting into groups here: https://pi-tau.github.io/posts/res-nets/#the-resnext-block-going-wider-instead-deeper

score -1 · Answer 2 · answered Apr 08 '23 at 00:48

If you use the same number of output channels for both approaches, they are basically the same. The first one uses 10*outputchannel number of 2d filters. The second approach uses outputchannels number of 3d filters that have a depth of 10. Both options are technically the same and use the same amount of weights. They can both capture dependencies between the different input layers to the same degree.

I personally prefer working with 2d filters but you can choose whatever you like.

I hope I was able to answer your question.

score -1 · Answer 3 · answered Jul 24 '23 at 13:04

-1

I am good at codes. How to shift between real temp to prefetch. Sap Grc Online Training Microsoft Intune Online training From India

answered Jul 24 '23 at 13:04

Chaitu Viswa

1
1