I have the following kind of data frame. These are just example:
A 1 Normal
A 2 Normal
A 3 Stress
B 1 Normal
B 2 Stress
B 3 Stress
C 1 Normal
C 2 Normal
C 3 Normal
I want to do 5-fold cross-validation and splitting the data using
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
data = (ImageList.from_folder(PATH)
.split_by_rand_pct(valid_pct=0.2)
.label_from_folder()
.transform(get_transforms(do_flip=True, flip_vert= True,max_zoom=1.1, max_rotate=10, max_lighting=0.5),size=224)
.databunch()
.normalize() )
It works great. It splits the data randomly which is expected. Though I want to keep the data points together in the training or validation, having the same value in column 1. So, all the A's would be in either the training or validation dataset, all the B's would be in the training or validation dataset, and so on.
More info on my data: I have cell assay images which are labelled in three classes. Now, these images are big in size, so I split one image into 16 small, non overlapping tiles, to bring down the size to 224( optimal enough to feed into CNN). All these tiles have the same label as the original image. These tiles are the final input to the CNN. TO perform cross-validation, I need to keep the tiles of same image into one fold and set.