How can I determine whether a video's frame is realistic (was recorded by a camera) or contains computer-generated graphics?

Question

Given a video, I'm trying to classify whether it is a graphical (computer-generated) or realistic scene. For instance, if it contains computer-generated graphics, credit, moving bugs, blue screen, etc. it will be computer-generated graphics, and if it is a realistic scene captured by camera, it will be a realistic scene.

How can we achieve that with AI? Do we have any working solutions available?

Some examples of graphical scenes:

Since neural networks are very good at finding patterns, you can try to just train a neural network (e.g. CNN) on you examples given the labels whether it is real or CGI. If you want then to understand what your NN has learned, you can [visualize its layers](https://cs231n.github.io/understanding-cnn/) — Aray Karjauv, Oct 23 '20 at 10:34
Should it be CNN only, or LSTM with multiple frames?! I need it to be lightweight. — Tina J, Oct 23 '20 at 14:10
It depends. If you have a series of images (i.e. a video), you can use LSTM to improve stability. But I would suggest starting simple. To make it lightweight, you can also downscale the images to smaller resolution and make them b&w. You can also use maxpooling layers/strides in your CNN — Aray Karjauv, Oct 23 '20 at 14:25
Umm I'm not yet much into architecting deep layers. If you could post a sample in python using Keras, it would be great. — Tina J, Oct 23 '20 at 19:15
Why not you try ELA(Error Level Analysis). Using ELA you can identifyedited scenes and non edited scenes. — Hiren Namera, Oct 23 '20 at 06:56
I posted my answer. Feel free to ask me if anything is unclear. If the answer is helpful, don't forget to accept it. — Aray Karjauv, Oct 24 '20 at 12:33

Aray Karjauv · Accepted Answer · 2020-10-24T09:48:05.107

As per your requirements, I would suggest that you start with any simple CNN network.

CNNs take advantage of the hierarchical pattern in data and assemble more complex patterns using smaller and simpler patterns. Therefore, on the scale of connectedness and complexity, CNNs are on the lower extreme.

Here is a Keras example:

model = models.Sequential()
model.add(layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=image_shape))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
# output layer
model.add(layers.Dense(1))

where image_shape is the resolution and number of channels of images (e.g. 128x128x3 for RGB images). I also suggest you downscaling the images to a lower resolution. You will also have to crop the images as they must all be the same image_shape.

Also take a look at the MaxPooling2D and BatchNormalization layers.

Since you only have real and CGI images, this becomes a binary classification problem. Therefore you can have a single output (0 - CGI, 1 - real). Such problems can be solved with BinaryCrossentropy loss.

model.compile(loss=losses.BinaryCrossentropy(from_loits=True), optimizer='adam')

Finally, you can fit your model

history = model.fit(train_images, train_labels, epochs=1000, validation_data=(test_images, test_labels))

You can find a complete example here.

Please note that depending on your data, the model can become biased if your dataset is unbalanced. That is, if all of your CGI images have text, and only a small fraction of the real images also have text, they might be misclassified. Therefore, I recommend that you visualize your model to better understand what it has learned. Here is an example of such a problem we faced at our university.

There are also more advanced CNN architectures such as ResNet, VGG or YOLO. You can also extend your model with time series (i.e. video) using LSTM or GRU architecture.

Thanks. I guess now i need to collect a few hundreds of samples for dataset. One qq: what's the rationale behind the node size 32,64,64? If I add another layer, it should go 128? — Tina J, Oct 24 '20 at 17:01
And is there any relationship between the number of layers and dataset size? Jut some intuition? — Tina J, Oct 24 '20 at 17:02
As far as I know, there is no relationship between the number of layers and dataset size. More layers mean more hidden features. ResNet and VGG networks use up to 100 hidden layers. — Aray Karjauv, Oct 25 '20 at 11:28
32, 64 and 64 are the number of filters. Filters present hidden features of the images. The first layer will usually be activated by straight lines (so the layer should be small), whereas the last layers will be activated by some abstract features and colors. Basically, you decide how many filter layers to contain. — Aray Karjauv, Oct 25 '20 at 11:49
Thanks. One last question: Do you know of any open dataset for this?! — Tina J, Oct 26 '20 at 00:48

How can I determine whether a video's frame is realistic (was recorded by a camera) or contains computer-generated graphics?

1 Answers1