Feeding CNN FFT of an image, a dumb idea?

Question

My dataset consists of about 40,000 200x200px grayscale images of centered blobs bathed in noise and occasional artifacts like stripes other blobs of different shapes and sizes, fuzzy speckles and so on in their neighborhood. They are used in a binary classification problem, with emphasis on recall.

I read that using FFT of image and FFT of the convolutional kernel and multiplying the two, produces a similar result as convolutions would but at a way lower resource expense. This is probably the most straightforward article I found if you need a more detailed description(https://medium.com/analytics-vidhya/fast-cnn-substitution-of-convolution-layers-with-fft-layers-a9ed3bfdc99a)

What I want to do however is simply feed the FFT of images to the standard CNN. The reasoning being, maybe it would be easier for the network to catch on to features that it would miss or tend to weigh less. Or in other words, FFT as a feature engineering technique.

Would this be an idea worth trying to pursue? If so, any suggestion on which FFT components to extract (Amplitude/Phase, Real/Imaginary)?

You may be interested in checking out the discussions in this [Kaggle competition](https://www.kaggle.com/c/rfcx-species-audio-detection/overview). It's an audio classification challenge actually, but there the approach is to use FT to produce a spectrogram. You may get some useful pointers to help with what you're trying to do. — Alexander Soare, Jun 15 '21 at 08:50
Lots of people do this for dimensionality reduction for somewhat mixed results. — FourierFlux, Jun 14 '21 at 22:45
I don't think that is a dumb idea. [This paper](https://arxiv.org/pdf/1905.13545.pdf) fed High-frequency components of the image and found interesting results. — Minh-Long Luu, Mar 08 '23 at 02:37

score 1 · Answer 1 · answered Jun 15 '21 at 11:20

FFT is in essence linear transformation of the input image and can be represented by application of convolutional filter of the same size as image on the input.

Provided, the convolutinoal neural network is deep enough with sufficient number of parameters and there are skip connections (in order to have a path of purely linear transformations on the input), FFT can be represented by the learned filters. If FFT of the image is relevant for the classification problem, NN most probably would learn to produce them in a certain way.

For image classification problems - when the goal is identify an instance of something, local information is crucial, and this problem is better solved in the spatial, not frequency domain.

However, for your case it seems, like the semantic is rather trivial, and the goal is to get rid of some frequencies. Hence, working in the frequency domain is a sensible option. Possibly, you can combine the spatial and frequency representation in some way.

I think, it would be simpler to work with the real and imaginary part, that with the complex abs and phase, since you need to account for periodicity of the phase in a certain way, and then in the end transform phase to $e^{i \phi}$.

Feeding CNN FFT of an image, a dumb idea?

1 Answers1