How can I do video classification while taking into account the temporal dependencies of the frames?

Question

I need to solve a video classification problem. While looking for solutions, I only found solutions that transform this problem into a series of simpler image classification tasks. However, this method has a downside: we ignore the temporal relationship between the frames.

So, how can I perform video classification, with a CNN, by taking into account the temporal relationships between frames?

I edited this post to simplify and clarify it, and to make it more consistent with the answer that you accepted. Please, make sure that I did not change the meaning of it. If I did, you can just edit your post again and clarify what you're asking. If not, please, flag this comment for deletion. — nbro, Dec 29 '20 at 11:13

Brian O'Donnell · Accepted Answer · 2020-12-27T17:04:14.813

Look at spatio-temporal CNNs which extend the image-based CNN in 2D to 3D to handle time. These are commonly used to detect or classify action in a video. People have used them to identify specific actions in various sports such as kicking a soccer ball, throwing a baseball or dribbling a basketball. They have been used to identify fire, smoke, deep fakes, and violence in videos. Below are some articles to help get started.

Source code:

How can I do video classification while taking into account the temporal dependencies of the frames?

1 Answers1