3

How can I classify a given sequence of images (video) as either moving or staying still from the perspective of the person inside the car?

Below is an example of the sequence of 12 images animated.

  1. Moving from the point of the person inside the car.

Class 0: Moving

  1. Staying still from the point of the person inside the car.

Class 1: still

Methods I tried to achieve this:

  1. A simple CNN (with 2d convolutions) with those 12 images (greyscaled) stacked in the channels dimension (like Deepmind's DQN). The input to the CNN is (batch_size, 200, 200, 12).

  2. A CNN with 3d convolutions. The input to the CNN is (batch_size, 12, 200, 200, 1).

  3. A CNN+LSTM (time-distributed with 2d convolutions). The input to the neural network is (batch_size, 12, 200, 200, 1).

  4. The late fusion method, which is taking 2 frames from the sequence that are some time steps apart and passing them into 2 CNNs (with same weights) separately and concatenating them in a dense layer As mentioned in this paper. This is also like CNN+LSTM without the LSTM part. The input to this net is (batch_size, 2, 200, 200, 1) -> the 2 images are first and last frames in the sequence

All the methods I tried failed to achieve my objective. I tried tuning various hyperparameters, like the learning rate, the number of filters in CNN layers, etc., but nothing worked.

All the methods had a batch_size of 8 (due to memory constraint) and all images are greyscaled. I used ReLUs for activations and softmax in the last layer. No pooling layer was used.

Any help on why my methods are failing or any pointers to a related work

nbro
  • 39,006
  • 12
  • 98
  • 176
Naveen
  • 153
  • 4

1 Answers1

1

CNNs are translation invariant.

You are over complicating the problem. The easiest thing you can do is define a region of interest (ROI) of the hood. In the first case where the car is moving and the reflections are dynamic. In the second case they are static. Just do frame-to-frame image subtraction of the hood. If the vehicle is moving you will have lots of 'edge energy'. If it is not moving it will be just noise.

You can apply the same method to the whole image too. In the static case the image subtraction method may become messy as the clouds are moving along with vehicles and pedestrians. For this case use the image subtractions as input to your method.

Another approach is to run an image stabilization algorithm. OpenCV has one. Look at the transformation outputs (translation, rotation, scale, rigid, similarity, affine, etc.). If you can't make a simple filter on them to determine the two cases, train a classifier.

Brian O'Donnell
  • 1,853
  • 6
  • 20
  • Do you mean translation equivariant? pooling layers are responsible for the invariance, right? and I didn't use pooling in my methods. – Naveen Apr 16 '18 at 06:12