I'm currently doing some researches on video recognition. What I'm trying to do is like this paper.
The idea is that: for processing a specific input video clip (shape: [T, C, H, W]), it needs features of the video clip from last timestamp, where we are trying to build a long-time feature memory.
As a result, we have to sequentially read consecutive video clips, and the dataloader will output the data like this: (the number represents the timestamp of a video clip, and each row represents a batch)
[1, 9, 18, 27, 36]
[2, 10, 19, 28, 37]
[3, 11, 20, 29, 38]
[4, 12, 21, 30, 39]
...
The thing is, due to the sequential reading, I couldn't do shuffling for the dataloader, which unsurprisingly results in a pretty bad performance.
For such streaming loading problem, are there any tips to improve the bad performance caused by no shuffling?
Noted: I have done the following experiments to verify that the problem is really caused by no shuffling:
- original dataloader with shuffling: good accuracy
- original dataloader without shuffling: worst accuracy
- the dataloader I mentioned in the question: medium accuracy but still bad