How can I split the data into training and validation sets such that entries with a certain value are kept together?

Question

I have the following kind of data frame. These are just example:

A 1 Normal
A 2 Normal
A 3 Stress
B 1 Normal
B 2 Stress
B 3 Stress
C 1 Normal
C 2 Normal
C 3 Normal

I want to do 5-fold cross-validation and splitting the data using

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

data = (ImageList.from_folder(PATH)
        .split_by_rand_pct(valid_pct=0.2)
        .label_from_folder()
        .transform(get_transforms(do_flip=True, flip_vert= True,max_zoom=1.1, max_rotate=10, max_lighting=0.5),size=224)
        .databunch()
        .normalize() )

It works great. It splits the data randomly which is expected. Though I want to keep the data points together in the training or validation, having the same value in column 1. So, all the A's would be in either the training or validation dataset, all the B's would be in the training or validation dataset, and so on.

More info on my data: I have cell assay images which are labelled in three classes. Now, these images are big in size, so I split one image into 16 small, non overlapping tiles, to bring down the size to 224( optimal enough to feed into CNN). All these tiles have the same label as the original image. These tiles are the final input to the CNN. TO perform cross-validation, I need to keep the tiles of same image into one fold and set.

I agree with you in usual cases it would. I have different set of data.I edited in my question to explain it. — user1631306, Dec 24 '19 at 15:34
@GeorgeWhite: Often it is the other way around - not performing a stratified split will ruin your validation, because you train on data that is correlated with validation data, and may think you have good generalisation but in fact it is bad. E.g. if your goal is to recognise emotions from images of faces in general (on an unseen person) and you have multiple pictures of same person, those pictures should stay together, either in training set or cv set or test set, but not in multiple sets. — Neil Slater, Dec 24 '19 at 17:49

How can I split the data into training and validation sets such that entries with a certain value are kept together?

0 Answers0