How to make convnets aware what the image actually is, not what is depicted on it?

Question

I've uploaded a picture to Wolfram's ImageIdentify of graffiti on the wall, but it recognized it as 'monocle'. Secondary guesses were 'primate', 'hominid', and 'person', so not even close to 'graffiti' or 'painting'.

Is it by design, or there are some methods to teach a convolutional neural network (CNN) to reason and be aware of a bigger picture context (like mentioned graffiti)? Currently it seems as if it's detecting literally what is depicted in the image, not what the image actually is.

This could be the same problem as mentioned here, that DNN are:

Learning to detect jaguars by matching the unique spots on their fur while ignoring the fact that they have four legs.²⁰¹⁵

If it's by design, maybe there is some better version of CNN that can perform better?

Any ANN (or for that matter any ML model) learns only from the training data that was used to train it. If the training data doesn't have examples of "graffiti" then there is no way the model can learn that. — Ankur, Aug 19 '16 at 05:15
Related: http://ai.stackexchange.com/questions/1689/how-wolframs-image-identification-project-works — Mithical, Aug 23 '16 at 20:28

score 3 · Answer 1 · answered Aug 18 '16 at 20:40

You seem to be wanting some description of the 'style' of an image.

To make that work in general, I'd guess that would actually require quite a lot of pre-processing to present 'texture elements' (rather than pixels) as the basic features.

This is quite speculative, but one approach might be to use Iterated Function Systems as a means of extracting these.

Whether 'spatial adjacency' (and hence CNN) is then the best approach to make higher-level decisions about these elements is (AFAIK) a matter for experiment.

Avik Mohan · Accepted Answer · 2016-08-19T23:46:11.350

Wolfram's image id system is specifically meant to figure out what the image is depicting, not the medium.

To get what you want you'd simply have to create your own system where the training data is labeled by the medium rather than the content, and probably fiddle with it to pay more attention to texture and things as such as that. The neural net doesn't care which we want - it has no inherent bias. It just knows what it's been trained for.

That's really all there is to it. It's all to do with the training labels and the focus of the system (e.g. a system that looks for edge patterns that form shapes, compared to a system that checks if the lines in the image are perfectly computer-generated straight and clean vs imperfect brush strokes vs spraypaint).

Now, if you want me to tell you how to build that system, I'm not the right person to ask haha

score 0 · Answer 3 · edited Aug 23 '16 at 00:14

If I look at the image, I can kind of see a monocle as part of the image. So one part of this is that the classifier is ignoring much of the image. This could be called a lack of "completeness", in the sense used here (a computer vision paper on image summarization).

One way to discover these sorts of failure modes is adversarial images, which are optimized to fool a given image classifier. Building on this, the idea of adversarial training is to simultaneously train competing "machines", one trying to synthesize data, the other trying to find weaknesses in the first one.

Also check this page: A path to unsupervised learning through adversarial networks, for further information about adversarial training.

How to make convnets aware what the image actually is, not what is depicted on it?

3 Answers3