4

I'm reading the ImageNet Classification with Deep Convolutional Neural Networks paper by Krizhevsky et al, and came across these lines in the Intro paragraph:

Their (convolutional neural networks') capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.

What's meant by "stationarity of statistics" and "locality of pixel dependencies"? Also, what's the basis of saying that CNN's theoretically best performance is only slightly worse than that of feedforward NN?

nbro
  • 39,006
  • 12
  • 98
  • 176
Shirish Kulhari
  • 383
  • 1
  • 10

1 Answers1

3

locality of pixel dependencies probably means that neighboring pixels tend to be correlated, while faraway pixels are usually not correlated. This assumption is usually made in several image processing techniques (e.g. filters). Of course, the size and the shape of the neighborhood could vary, depending on the region of the image (or whatever), but, in practice, it is usually chosen to be fixed and rectangular (or squared).

stationarity of statistics might mean that the values of the pixels do not change over time, so this could be related to diffusion techniques in image processing. stationarity of statistics might also mean that the values of the pixels do not change much in a spatial neighborhood, even though, stationarity, e.g. in reinforcement learning, usually means that something does not change over time (so, if that's the case, the terminology stationarity is at least misleading and confusing, in this context), so this might be related to the locality of pixel dependencies property. Possibly, stationarity of statistics could also indirectly mean that you can use the same filter to detect the same feature in different regions of the image.

With while their theoretically-best performance is likely to be only slightly worse the authors probably thought that, theoretically, CNNs are not as powerful as feedforward neural networks. However, both CNNs and FFNNs are universal function approximators (but, at the time, nobody probably had yet investigated seriously the theoretical powerfulness of CNNs).

nbro
  • 39,006
  • 12
  • 98
  • 176
  • I really think in this case stationarity isnt regarding time, but spatial location (given no time exists-- im assuming you meant over the dataset) I think theyre saying alot of local statistics are common across multiple areas of the image (therefore a single convolved filter can be a useful featurization of multiple fields rather than just one) – mshlis Jul 31 '19 at 01:19
  • @mshlis You might be right. This was my interpretation, given the usual meaning of stationarity (e.g. in reinforcement learning). However, if your interpretation is correct, then their terminology is highly confusing or misleading. – nbro Jul 31 '19 at 01:36
  • I completely agree, generally stationarity when referring to images draws on the distributions of the pixels themselves, but here it seems to be used based on the dependencies, treating the moving indices of the convolution to parametrize some from of process. – mshlis Jul 31 '19 at 02:06
  • Thanks for the answer! Could you edit the part about stationarity of statistics? I don't think there's any temporal aspect to image processing at least in the context of the question. I can then accept the answer once that part of edited. Thanks again! – Shirish Kulhari Jul 31 '19 at 09:19
  • @ShirishKulhari You can view an image or a sequence of images as a flow over time, so this interpretation is not completely wrong. Have a look at certain image processing techniques that are based on diffusion. – nbro Jul 31 '19 at 10:12
  • @nbro: Oh, my bad. I can understand that a video could definitely be interpreted as a sequence of images as a flow over time, but I'm not clear on how a single, static image be interpreted as such. Could you elaborate about that in the answer, if possible? – Shirish Kulhari Jul 31 '19 at 10:19
  • @ShirishKulhari Essentially, a noisy image $\hat{f}$ of an original image $f$ might be viewed as an image of a sequence of images, $f, f_1, f_2, \dots, \hat{f}, \hat{f}_1, \dots$, that starts at the original image and goes through $\hat{f}$, so this sequence of images is part of a "flow". This is just an assumption or interpretation made by some image processing techniques, like anisotropic diffusion, which is used to denoise an image. If you're interested in this, have a look at the details of this technique. – nbro Jul 31 '19 at 10:21
  • @nbro: So denoising involves figuring out the transformation $f \to \hat f$ and inverting it? – Shirish Kulhari Jul 31 '19 at 10:23
  • @ShirishKulhari Yes, that's basically the idea. – nbro Jul 31 '19 at 10:24