4

I am trying to rank video scenes/frames based on how appealing they are for a viewer. Basically, how "interesting" or "attractive" a scene inside a video can be for a viewer. My final goal is to generate say a 10-second short summary given a video as input, such as those seen on Youtube when you hover your mouse on a video.

I previously asked a similar question here. But the "aesthetics" model is good for ranking artistic images, not good for frames of videos. So it was failing. I need a score based on "engagement for general audience". Basically, which scenes/frames of video will drive more clicks, likes, and shares when selected as a thumbnail.

Do we have an available deep-learning model or a prototype doing that? A ready-to-use prototype/model that I can test as opposed to a paper that I need to implement myself. Paper is fine as long as the code is open-source. I'm new and can't yet write a code given a paper.

Tina J
  • 973
  • 6
  • 13
  • In your model, are you looking for an accurate summary, or want to maximise interest (whilst still limiting output to an edit from the referenced video)? The two goals are often not compatible, witness any film trailer, YouTube "clickbait" etc. I am asking because I think I have seen references to work on the goal of generating accurate summaries, and might be able to find something. But that doesn't appear to be what you want? – Neil Slater Aug 27 '19 at 21:08
  • Not really accurate summary, but to maximize interest. Yes, it's highly subjective. We don't know the best solution. We just need "a" solution! As long as a model is targeting that concept, it should be fine. – Tina J Aug 27 '19 at 21:14
  • @NeilSlater Something like this: https://www.neonopen.org/ they claim their deep models find thumbnails that will drive more clicks, likes, and shares. ​But their codes are client/server based and not easy for me to run. – Tina J Aug 27 '19 at 21:16
  • 1
    OK, I don't know that area well enough. So I was going to suggest things like https://github.com/Pratik08/vis-dss (although note GPL licensing, which may not suit a commercial product), which is geared around creating an *accurate* summary. It's an interesting area though, so I hope you find an answer. Sometimes accuracy is an attractive end goal of course, if for instance you were indexing video with the goal of helping an end user finding something that they were looking for – Neil Slater Aug 27 '19 at 21:35
  • Thanks Neil. I will look into that repo. If anything came into mind within a short term, please let me know. – Tina J Aug 27 '19 at 22:02
  • @NeilSlater Neil, did you try building the repo yourself? `cmake` runs ok, but not the `make`. – Tina J Aug 29 '19 at 22:21
  • No I have not tried building Vis-DSS – Neil Slater Aug 30 '19 at 07:08
  • This strikes me as an extremely difficult problem due to the subjectivity of the response that you are trying to evaluate. At the very least, you would need to include the demographics of the viewer into the model. – DrMcCleod Sep 12 '19 at 07:52

1 Answers1

1

One of the key terms in the literature that you are looking for is video captioning.

You can have a look at some of the relevant papers with code on this subject. In short, it is an active area of research and a difficult problem, one reason is because videos are still difficult to learn about (because of larger amount of data + larger model, etc...) and this model has to be working with two modalities of data: text and image.

A paper that you might want to start with is Deep Visual-Semantic Alignments for Generating Image Descriptions which works on single images. In short, you can use something similar like in the paper: object detector (e.g. Faster RCNN) to extract visual features and feed them into the state of an RNN (LSTM) which would output a sequence of words in your summary (see picture below). image captioning model

Anuar Y
  • 404
  • 4
  • 5
  • Thanks. But how is video captioning related to video scoring? I like to know how a scene is interesting for a viewer. – Tina J Sep 11 '19 at 23:31
  • Right, I focused on "generate say a 10-second short summary given a video as input". Yeah I see your problem. It sounds like a specific problem for which there wouldn't be a dataset available online (don't know at the top of my head). If you have an opportunity to create your own dataset then you can perform regression on your score directly. For example this paper: [YouTube-8M](https://arxiv.org/pdf/1609.08675.pdf) performs video classification. Then instead of predicting a class you would predict a score (also changing the loss to l1 or l2 loss). – Anuar Y Sep 11 '19 at 23:40
  • This is a useful answer for a different question, which is why I downvoted it. – DrMcCleod Sep 12 '19 at 07:53