2

Consider a fixed camera that records a given area. Three things can happen in this area:

  • No action
  • People performing action A
  • People performing action B

I want to train a model to detect when action B happens. A human observer could typically recognize action B even with a single frame, but it would be much easier with a short video (a few seconds at low FPS).

What are the most suitable models for this task? I read this paper where different types of fusion are performed in order to feed different frames to a CNN. Are there better alternatives?

firion
  • 269
  • 1
  • 7

1 Answers1

3

It seems you need a spatio-temporal model to learn human-body detection and action. With regards to interesting papers on the subject I would recommand to look at Vicky Kalogeiton web site.

Her PhD thesis 2017, V. Kalogeiton, Localizing spatially and temporally objects and actions in videos,

basically cover her 3 papers on the subject:

summary of Kalogeiton's PhD introduction .

Introduce an end-to-end multitask objective that jointly learns object-action relationships. The action-object detector leverages the temporal continuity of videos.

Though intra class variations are key and appears as spatial location accuracy, appearance diversity, image quality, aspect distribution, and object size and camera framing.
Actions class refers to an atomic class such as jump, walk, run, climb, etc.

The detector builds anchor cuboids named tubelets and defined as sequences of bounding boxes with associated scores. The action detection spans over a period of time (first and last video frame detected) and takes place at a specific location in each frame. Intra frame action detection can be ambigous. On the other way a sequence bears more information (across class similarities) than a single frame to infer action.

Most previous work uses per-frame object detections, and then leverage the motion of objects to refine their spatial localization or improve their classification.

Contributions

  • differences between still and video frames for training and testing an object detector among which (see Chapter 3 for more details ):
    • spatial location accuracy,
    • appearance diversity,
    • image quality,
    • aspect distribution,
    • camera framing
  • jointly detect object-action instances in uncontrolled videos using an end-to-end two stream network architecture (see chapter 4 for more details )
  • propose the ACtion Tubelet detector (ACT-detector), which takes as input a sequence of frames and outputs tubelets, i.e. sequences of bounding boxes with associated scores (see chapter 5 for more details).
pascal sautot
  • 231
  • 1
  • 8
  • this answer could complement mine https://ai.stackexchange.com/questions/1481/how-can-action-recognition-be-achieved?noredirect=1&lq=1 – pascal sautot Dec 12 '19 at 11:56