Having the negative cases in the same batch vs. shuffling the dataset

Question

I am working on a model for an NLP task. The model encodes the text and has a regression output layer.

In this task, from each instance (positive), I create several negative cases using a specific technique and I merge them with their positive corresponding ones in a data split (training/val/test). After that, I shuffle the data split.

I was thinking of the following: Isn't better to keep the negative instances with their corresponding positive ones in the same batch instead of shuffling the data?

Is there an answer to this question? does it depend on the task?

Can you explain why wouldn't you use the negative cases and positive cases in the same batch? Why are you separating the cases? Can you provide more details about your specific task? — nbro, Mar 17 '22 at 09:43
@nbro I shuffle the data because this helps the training converge fast. It's hard to give details here but the task is to train a model to produce a score. I have instances with 1.0 scores, and with the negative sampling, I created instances that have scores less than 1.0 (0, 0.33, 0.67, etc.). The negative instances are similar to the original ones and share part of the text. — Minions, Mar 17 '22 at 16:29

score 0 · Answer 1 · answered Mar 15 '22 at 17:28

Depending on how you generate the negative cases, you may have a potential for data leakage from your test set. If you have a positive case in your test set, and use some rule to "negate" that case and put it in your training set, you now have a training sample that is not truly independent of your test data. With this approach, you may wind up with a biased performance estimate that is over-optimistic, as the model was trained based on data that was a modification of the test data. The size of this problem will depend on how you generate the synthetic cases from real ones and will be difficult to quantify, so you may be justified in keeping the real positive cases and synthetic negative cases in the same batch to avoid this problem.

Thanks, @Nuclear. The generation of the negative cases is split-based; The negative cases that are in the training part were created from the positive ones in the training part. I don't move them across the splits. My point is, should I shuffle the training part then or keep the negative cases with their positive ones in the same batch? — Minions, Mar 15 '22 at 17:35

Having the negative cases in the same batch vs. shuffling the dataset

1 Answers1