I need some explanation about the following paragraph (page 3) from the paper A Novel Approach for Robust Multi Human Action Detection and Recognition based on 3-Dimentional Convolutional Neural Networks.
We introduce a 3D convolution neural network with the following notations: $I(x, y, d)$ as an input video with a size of $x y$ and $d$ the temporal depth
What is "temporal depth"? Is it the number of frames?