It seems you need a spatio-temporal model to learn human-body detection and action.
With regards to interesting papers on the subject I would recommand to look at Vicky Kalogeiton web site.
Her PhD thesis
2017, V. Kalogeiton, Localizing spatially and temporally objects and actions in videos,
basically cover her 3 papers on the subject:
- 2016, Kalogeiton, V., Schmid, C., and Ferrari, V. Analysing domain shift factors between videos and images for object detection,
- [2017a, Kalogeiton, V., Weinzaepfel, P., Ferrari, V., and Schmid, C. Action tubelet detector for spatio-temporal action localization]Kalogeiton 2017a,
- 2017b, Kalogeiton, V., Weinzaepfel, P., Ferrari, V., and Schmid, C. Joint learning of object and action detectors
summary of Kalogeiton's PhD introduction .
Introduce an end-to-end multitask objective that jointly learns object-action relationships.
The action-object detector leverages the temporal continuity of videos.
Though intra class variations are key and appears as spatial location accuracy, appearance diversity, image quality, aspect distribution, and object size and camera framing.
Actions class refers to an atomic class such as jump, walk, run, climb, etc.
The detector builds anchor cuboids
named tubelets and defined as sequences of bounding boxes with associated scores. The action detection spans over a period of time (first and last video frame detected) and takes place at a specific location in each frame. Intra frame action detection can be ambigous. On the other way a sequence bears more information (across class similarities) than a single frame to infer action.
Most previous work uses per-frame object detections, and then leverage the motion of objects to refine their spatial localization or improve their classification.
Contributions
- differences between still and video frames for training and testing an object detector among which (see Chapter 3 for more details ):
- spatial location accuracy,
- appearance diversity,
- image quality,
- aspect distribution,
- camera framing
- jointly detect object-action instances in uncontrolled videos using an end-to-end two stream network architecture (see chapter 4 for more details )
- propose the ACtion Tubelet detector (ACT-detector), which takes as input a sequence of frames and outputs tubelets, i.e. sequences of bounding boxes with associated scores (see chapter 5 for more details).