0

I am trying to classify whether or not a specific object is in panoramic photos. The issue is, a panoramic photo can be any width, so the input to my neural network can't be fixed in that dimension.

I've been using RNNs (QRNNs to be specific, as I am trying to reduce the number of parameters as much as possible), but they always learn where the object usually is in the image, and then have a really hard time classifying an image with the object in a different place.

I'm looking for something similar to CNNs, where it doesn't have a spatial dependance (or in this case, a temporal dependance?), but it can't have a fixed input width.

Any ideas?

desertnaut
  • 1,005
  • 10
  • 19
  • pretty sure you can just use a bunch Conv layers, then do something like a global average pooling and some dense layer for the classification – Alberto Aug 31 '23 at 14:34
  • As I said, I don't know the input size of my network, so conv layers are out. –  Aug 31 '23 at 14:46
  • 1
    You are probably missing something on how convolution works, as they do not at all rely on input size (width and height wise), you can definitely train convolutional layers without any knowledge of the input size of each image – Alberto Aug 31 '23 at 16:14
  • CNNs cannot be trained with variable input sizes unless you use padding, which won't work in my case. –  Aug 31 '23 at 16:19
  • Again, yes, you can, convolution does not depend on the width and heigh of your images, please try and you will see that it's indeed the case, of if you want to be more theoretical, check the definition of convolution, and you will see that they indeed can. – Alberto Aug 31 '23 at 17:11
  • The output side of a convolutional layer is defined as follows: o = ((i + 2p - k - (k - 1)(d - 1)) / s) + 1, where o is the output size, i is the input size, p is the padding, k is the kernel size and d is the dilation. I don't know who told you that convolutions don't depend on your input size, but they are simply incorrect. –  Aug 31 '23 at 17:17

1 Answers1

2

Listen, this is not an answer to your question, but it seems that you are missing the whole point of convolution.

Simplified explanation: Convolution is just a weighted sum of the neighbors of a pixel

You see how this is not dependent on the size of the image?
Take a 3x3 filter, on a NxM image, apply convolution, and you will get a (N-2)x(M-2) image as output

Now take a TxS image, and apply the same filter over it, what you get (T-2)x(S-2)

you see now that you can apply a convolutional layer to any arbitrary sized image?

You still don't believe me? take this code, and you will see that you can input two images of different sizes to this neural network and it won't complain:

network = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, kernel_size=3, activation=tf.nn.leaky_relu, padding="SAME"),
    tf.keras.layers.Conv2D(32, kernel_size=3, activation=tf.nn.leaky_relu, padding="SAME"),
    tf.keras.layers.Conv2D(32, kernel_size=3, activation=tf.nn.leaky_relu, padding="SAME"),
    tf.keras.layers.Conv2D(32, kernel_size=3, activation=tf.nn.leaky_relu, padding="SAME"),
    tf.keras.layers.GlobalMaxPooling2D(),
    tf.keras.layers.Dense(NUM_CLASSES, activation="softmax"),
])

you're welcome :-)

Alberto
  • 591
  • 2
  • 10
  • You just proved me right. Applying a 3x3 kernel to a TxS image gives you a (T - 2)x(S - 2) output, which is the MATHEMATICAL DEFINITION of dependance. Regardless, this is not an answer to the question. Please stop. –  Sep 01 '23 at 01:16
  • 1
    This actually is an answer to the question (if not, it's not entirely clear what the real question is: indeed the title of the question doesn't seem to match the body of the question too clearly). @Ghull Yes, conv layers *depend* on the input size, in the sense that their *output size* depends on the input size. Bigger input size --> bigger output size. But that's great, that means they **can** handle different input sizes, which is exactly what you need. Just need to make sure to, after conv layers, have something (like pooling) that reduces everything to a single, fixed dim (num. classes). – Dennis Soemers Sep 01 '23 at 09:49
  • @Ghull seems like somebody agrees with me, the number of parameters of convolution is not dependent of the input size, so it's said not to be "dependent from input" as a fully connected layer for example. You can perfectly use the code I provided to solve your problem without any problem – Alberto Sep 01 '23 at 13:43
  • @DennisSoemers thank you – Alberto Sep 01 '23 at 13:44