0

I have videos that are each about 30-40 mins long. With the first 5-10 mins (at 60fps, can be down-sampled to 5fps) are one type of activity that would be categorized by label-1 and the rest of the video as label-2. I started off by using CNN-LSTM to do this prediction (Resnet-50 + LSTM + FC-classifier).

Using Pytorch...

For training my initial approach was to treat this as an activity classification task. So, I split my videos into smaller segments with each segment having a label.

video1.mp4 (5 mins) --> label-1

Split into 30 seconds -->

video1_0001.mp4 --> label-1

.

.

.

video1_0010.mp4 --> label-1

But, with this strategy even after 100 epochs the network does not train. I can at the most fit about 40 frames on the 2-GPUs, but a 30 second segment of video @5fps has about 150 frames. Any further subsampling seems to not capture the essence of the video segment.

I also tried training without shuffling and with single thread, so a single stream is loaded continuously. But perhaps its not the right strategy.

I wanted to request some help on how to tackle the problem. I would really appreciate some insights into a training strategy for this problem.

  • Is using CNN-LSTM the right strategy here?

=== Update ===

After reading a few other posts on similar topics,

I feel that to get the network to see a larger part of the sequence, I will have to use more GPUs or resize images. However, since the pretrained Resnet accepts 224x224, I will need more GPUs. But I am curious; is there another strategy? Because the question could also be about what is the ideal segment length that would enable the network to learn.

From my perception, a 30 second video sampled @5fps at the bare minimum captures the context. From observation so far, going below this number hasn't allowed the network to learn.

ekmungi
  • 101
  • 2

1 Answers1

2

Using CNN into an LSTM is definitely a valid option for your task. There are also papers on integrating the LSTM mechanism directly into the convolution layers (like RCNN), which would be an alternative to try out. As you already identified, these architectures require a lot of memory, because your classifier will depend on the full sequence of images and you have to store the gradient for each of them.

Ways to attenuate the problem with memory are:

  • Resizing the images
  • Choosing a smaller sequence length
  • Smaller networks
  • Preencoding the images (only works if the convolutional encoder isn't trained)
  • ...

Smaller networks might be an option for you. ResNet50 ist quite large, so you might want to explore some smaller convolutional encoders like EfficientNet. Below is a figure plotting number of parameters vs ImageNet top-1 accuracy for multiple architectures. Complementary, you can save even more parameters by substituting the LSTM with a GRU. The number of parameters in the GRU scales quadratically with layer size, so reducing layer size of LSTM/GRU is worth experimenting with.

Another option when you are using pretrained models is to freeze the weights of the pretrained model and preencode the images images in your dataset. So first, apply your ConvNet to each frame of your video dataset and store the embedding. This gives you a dataset with just the embeddings on which you can then train the LSTM independently.

Having solved the problem, you may be able to experiment with longer sequences and experiment a bit to find out what works. For more help you may want to include a bit more information about your data. It sounds like you are doing human action classification? If so you can explore paperswithcode a little to get inspiration.

Image Source

Chillston
  • 1,501
  • 5
  • 11
  • Thank you @Chillston, appreciate the insight. I agree with you regarding the options, Resizing the images - depends on the pretrained network. Choosing a smaller sequence length - This option hasn't yielded much benefit. Smaller networks - Will try it, I didn't think about it. Preencoding the images - This option cross my mind, but I was not sure if it would provide much benefit. My image domain is not natural image (such as imagenet). Using a GRU - I will read up about it, I was focused on LSTMs because they are the state-of-the-art. – ekmungi Mar 19 '22 at 16:39
  • My domain is not exactly action recognition, I would like to apply them to medical images. Hence to continue training the backbone feature extractor. – ekmungi Mar 19 '22 at 16:46
  • Do you think Resnet + LSTM is able to capture the temporal context sufficienty? From the link you suggested (thanks again for that suggestion), I came across spatio-temporal Resnets that include an additional temporal context from optical flow. My original consideration was to use the output of Flownet as an input to the LSTM network for further classification. What do you think? – ekmungi Mar 19 '22 at 17:16
  • Regarding GRU vs. LSTM: GRUs can outperform LSTMs, depending on the data (e.g. [this paper](https://arxiv.org/pdf/1412.3555.pdf)). If you need SOTA temporal processing performance, then the self-attention is the way to go (e.g. [Transformer](https://jalammar.github.io/illustrated-transformer/)). A general piece of advice is to keep things as simple as possible until you have something that works :) To give you my opinion on the Resnet + LSTM architecture, a few more details about the problem will be very helpful. What exactly is depicted in the images and whats their temporal relation? – Chillston Mar 19 '22 at 18:07
  • Thanks. The data I am looking at are endoscopic images. And I am trying to determine the direction of motion in these videos. – ekmungi Mar 24 '22 at 17:05
  • In that case, [Optical Flow estimation](https://paperswithcode.com/task/optical-flow-estimation) might be a nice keyword for you to look into. Such models output a vector for every pixel, indicating where it will be in the next frame. Here is a [model that uses self-attention](https://arxiv.org/pdf/2104.02409v3.pdf) for that :) CNN+LSTM might work as well though. What might be a pitfall in this task is sampling rate. I would experiment with higher sampling rates ~20fps and rather reduce the number of images in the time series, so that the motion happens somewhat _slower_. What do you think? – Chillston Mar 26 '22 at 10:44
  • Agreed, my problem has been with the sampling rate. There is no (_trivial and_) meaningful way to sample the videos. What you are suggesting is to keep the fps high and sparsely sample frames from this. Is that correct? The issue I foresee with this is that (_in the worst case_), the batch will end up with a bunch of blurred or water filled frames. – ekmungi Mar 28 '22 at 14:16
  • Regarding the backbone feature detector, I am wondering if using an optical flow might be an overkill, as optical flow has to find match for every pixel between frames. Perhaps I could get away with using a shallow CNN network (untrained). What do you think? Of course, I have to try it, but thinking out aloud. :) – ekmungi Mar 28 '22 at 15:21
  • I agree, your use case is the slimmed down version of the optical flow problem. So I think a CNN+LSTM or CNN+Self-Attention approach should generally work. However, what might be difficult is that visual features move in multiple directions from one image to the next, right? E.g. if you move the endoscope forward you get a radial zoom effect. I'd actually try to increase sampling rate and sample subsequent frames (not skipping frames), because this increases the resolution of the movement, thus making it smoother. – Chillston Mar 28 '22 at 17:27
  • In this case ResNet50 really seems like an overkill, but you have to find out empirically. You may want to start with a really small network to get a baseline and see if performance increases when you add more layers. I'd start with only few CNN layers (3 to 5 maybe) - It will train much faster and provide you with a lower bound for performance. And from there it probably is a lot of trial and error to find a nice architecture :) Is there any dataset on the internet that compares to your task? – Chillston Mar 28 '22 at 17:31
  • So yes, I think this is a great idea :) – Chillston Mar 28 '22 at 17:39
  • Unfortunately not, there isn't a dataset to try this out on :(. I will try the shallow network idea and let you know how it works. – ekmungi Mar 30 '22 at 16:38
  • Sure, I'm interested - I wish you much success with the project – Chillston Mar 31 '22 at 09:43
  • :) thanks!!I have another question regarding training with lstm. During training, does it make sense to label every frame in the segment or have a single annotation for the segment. The motivation for this question is to understand how to consider the output from a lstm. In other words, should the output for the complete segment be considered and passed on to the loss or the loss from individual frames in the segment be accumulated and backpropagated? When I write it down the former seems obvious, but I would like to have a second opinion :). Thank you! – ekmungi Apr 03 '22 at 05:16
  • That's a good question. Without knowing how the videos look like, I would assume that only experiments will show what works best. I'd go for the approach where you have labels for each time step, so you don't have to worry about sequence chunks where both classes occur. Intuitively having a label for every time step should also yield a better gradient, but I might be wrong about that. – Chillston Apr 04 '22 at 15:34