I have a video dataset as follows.
Dataset size: 1k videos
Frames per video: 4k (average) and 8k (maximum)
Labels: Each video has one label.
So the size of my input will be (N, 8000, 64, 64, 3) 64 is height and width of video. I use keras. I am not really sure how to do an end-to-end training with this kind of dataset. I was thinking of dividing each input in blocks of frames (N, 80, 100, 64, 64, 3) for training. But still it wont work for an end-to-end network training.
I am not in favor of dropping the frames. That might be my last choice.
Any help will be appreciated. Thanks in advance.