I thought 112x112x192 depth tensor convoluted with 3x3x192 would give 56x56x(192x192)
But this is different. How do you pass from the 192 depth in the first tensor to the 256 in the second tensor?
I thought 112x112x192 depth tensor convoluted with 3x3x192 would give 56x56x(192x192)
But this is different. How do you pass from the 192 depth in the first tensor to the 256 in the second tensor?
I suppose that 3x3x192
in Conv. Layer refers to "Kernel" size - you may think of it as a "scanner" that scans through an input tensor. This kernel (scanner) has a length and width of 3, and a depth of 192. However, the result of one kernel application, one scan is just one, scalar number. So one kernel in convolution results in a flat, 1 dimension depth result tensor. In your case: 56x56x1
So what about this 256 depth in output dimensions? I think, that this architecture may use 256 different filters, 256 different kernels/scanners. It is not stated anywhere in your question, but I would suppose that's the case, as a often-used technique.
That basically means that the convolution process is performed on your left-side tensor 256 times with 256 different kernels and results are stacked to one output tensor of depth 256.
This is used for example, when you want to train your model to recognize many (in this case 256) different features in the input tensor.
You may find the picture below helpful, note that the depth of result (H) is not related to input size, and it's just number of filters/features we want to detect.
Source of image: here