Datasets input at model.fit produce unexpected results of training loss vs validation loss

Question

Im trying to train a neural network (VAE) using tensorflow and Im getting different results based on the type of input in the model.fit.

When I input arrays I get normal difference between the validation loss and the total loss. When I input a dataset based on the same input I get a normal total loss and a really small validation loss.

I havent changed the model. The only things that changes is the input format.

The code for when I input an array. train slices is (2627,138,138,1) and define the batch size in the model.fit

train_slices = preprocess_data(CropTumor, file_array[train_dataset])
val_slices = preprocess_data(CropTumor, file_array[val_dataset])



# reset model weights before training
VAE.set_weights(initial_weights)

# fit model
fit_results = VAE.fit(train_slices,train_slices,
                      epochs=1000,
                      validation_data=(val_slices,val_slices),
                      callbacks=[early_stopping_kfold, tensorboard_callback],
                      batch_size=batch_sz,
                      verbose=2
                      )

The output

Epoch 1/1000
2022-08-01 11:56:35.683852: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
2022-08-01 11:56:36.371780: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-08-01 11:56:36.461054: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
672/672 - 7s - loss: 537.2896 - val_loss: 213.7070 - 7s/epoch - 11ms/step
Epoch 2/1000
672/672 - 5s - loss: 248.5211 - val_loss: 161.9758 - 5s/epoch - 7ms/step
Epoch 3/1000
672/672 - 4s - loss: 192.8771 - val_loss: 125.9349 - 4s/epoch - 6ms/step
Epoch 4/1000
672/672 - 4s - loss: 153.1647 - val_loss: 99.4395 - 4s/epoch - 6ms/step
Epoch 5/1000
672/672 - 5s - loss: 132.0143 - val_loss: 88.9975 - 5s/epoch - 7ms/step
Epoch 6/1000
672/672 - 4s - loss: 118.5642 - val_loss: 81.1653 - 4s/epoch - 6ms/step
Epoch 7/1000
672/672 - 5s - loss: 108.6678 - val_loss: 76.6315 - 5s/epoch - 7ms/step
Epoch 8/1000
672/672 - 4s - loss: 100.9759 - val_loss: 73.8963 - 4s/epoch - 6ms/step

When on the other hand I use the same data in the form of dataset I get a really small validation loss

    train_dset = 

tf.keras.preprocessing.image_dataset_from_directory(directory="./Data/09_TrainingSet_VAE1",
                                                                     labels=None,
                                                                     label_mode=None,
                                                                     image_size=(138, 138),
                                                                     color_mode="grayscale",
                                                                     batch_size=None,
                                                                     shuffle=True)

    val_dset = tf.data.Dataset.from_tensor_slices(val_slices)
    train_dset = (train_dset.map(preprocess_dataset).batch(batch_sz).shuffle(1))
    val_dset = (val_dset.map(preprocess_dataset).batch(batch_sz).shuffle(1))
    # reset model weights before training
    VAE.set_weights(initial_weights)

    # fit model
    fit_results = VAE.fit(train_dset,
                          epochs=10,
                          validation_data=val_dset,
                          callbacks=[early_stopping_kfold, tensorboard_callback],
                          verbose=2
                          )

And my output is

 Epoch 1/10
2022-08-01 12:04:08.656012: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
2022-08-01 12:04:09.335957: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-08-01 12:04:09.431082: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
613/613 - 7s - loss: 466.5601 - val_loss: 17.3872 - 7s/epoch - 12ms/step
Epoch 2/10
613/613 - 5s - loss: 217.4277 - val_loss: 7.7309 - 5s/epoch - 8ms/step
Epoch 3/10
613/613 - 5s - loss: 167.2855 - val_loss: 6.0742 - 5s/epoch - 9ms/step
Epoch 4/10
613/613 - 6s - loss: 130.8230 - val_loss: 1.9557 - 6s/epoch - 10ms/step
Epoch 5/10
613/613 - 6s - loss: 112.1165 - val_loss: 1.1561 - 6s/epoch - 10ms/step
Epoch 6/10
613/613 - 5s - loss: 101.3152 - val_loss: 0.6442 - 5s/epoch - 8ms/step
Epoch 7/10
613/613 - 5s - loss: 93.3648 - val_loss: 0.4150 - 5s/epoch - 8ms/step
Epoch 8/10
613/613 - 5s - loss: 87.1542 - val_loss: 0.2232 - 5s/epoch - 8ms/step

The loss function for both is

def loss_func(encoder_mu, encoder_log_variance):
def vae_reconstruction_loss(y_true, y_predict):

    reconstruction_loss = tf.math.reduce_sum(tf.math.square(y_true-y_predict), axis=[1, 2, 3])
    return reconstruction_loss

def vae_kl_loss(encoder_mu, encoder_log_variance):
    kl_loss = -0.5 * tf.math.reduce_sum(1.0 + encoder_log_variance - tf.math.square(encoder_mu) - tf.math.exp(encoder_log_variance),
                              axis=1)
    return kl_loss


def vae_loss(y_true, y_predict):
    reconstruction_loss = vae_reconstruction_loss(y_true, y_predict)
    kl_loss = vae_kl_loss(y_true, y_predict)
    loss = reconstruction_weight*reconstruction_loss + kl_weight*kl_loss
    return loss

return vae_loss

and model is compiled with

VAE.compile(optimizer=tfk.optimizers.Adam(learning_rate=learning_rate),
        loss=loss_func(encoder_mu_layer, encoder_log_variance_layer))

score 0 · Answer 1 · answered Aug 01 '22 at 10:01

Solved it. It was a bug in my code after all. preprocess_dataset function divide the data by 255 and I used the same for both the training dataset which was coming from a directory (0-255 range) and the validation dataset which came from a loaded array already divide it by 255

Datasets input at model.fit produce unexpected results of training loss vs validation loss

1 Answers1