2

If we look at state of the art accuracy on the UCF101 data set, it is around 93% whereas for the HMDB51 data set it is around 66%. I looked at both the data sets and both contain videos of similar lengths. I was wondering if anyone could give an intuition as to why HMDB51 data set has been harder.

ksh
  • 31
  • 3

1 Answers1

2

It is true that at first look, one could expect that classification between 101 categories would be harder than classification between 51 categories. However, many aspects play a role when it comes to action recognition applications.

For instance, the HMDB51 contains several categories about different facial movements like smiling, laughing, chewing,... and several other categories like eating, drinking. Such categories are not present in the list of categories of the data set UCF101 and are obviously among the most difficult categories to deal with. It also claims to have some bad quality challenging videos.

It is hard to predict in advance how difficult a data set will be to classify. We can imagine that when the state-of-the-art reach accuracy beyond 90%, it is time to build a data set that makes these methods fail to look for even more robust solutions. I don't know these data sets well, but the videos are present most probably more variability in terms of viewpoint, camera motions, illumination changes, image quality,... in the most difficult to classify data set.

Also, check on this page, the results announced for the UCF101 data set. I don't know where you found you accuracy value because the official website announces less than 43.9%. Some publications do not use the complete data set and use only part of it to show the performance of an approach they designed.

Finally, the official website of the HMDB51 data set reports the following: "The UCF group has also been collecting action datasets, mostly from YouTube. There are UCF Sports featuring 9 types of sports and a total of 182 clips, UCF YouTube containing 11 action classes, and UCF50 contains 50 actions classes. We will show in the paper that videos from YouTube could be very biased by low-level features, meaning low-level features (i.e., color and gist) are more discriminative than mid-level fears (i.e., motion and shape)." This could also explain why better results can be achieved...

Eskapp
  • 260
  • 1
  • 9
  • The addition of clips from feature movies in the HMDB51, which do not appear in the UCF101 is one difference I noticed in the literature, but not enough to explain such a great disparity as the question author experienced. The idea about the facial movements seems the most likely. I have no better hypothesis, so I up-voted. – Douglas Daseeco Jul 29 '18 at 10:36