2

I am having 2 questions as follows:

  1. I am using normal CNN model for time series classification, Where my dataset1 is of shape (28,9,1) and I trained my CNN model on dataset1, where now the input layer of my CNN model is (28,9,1). Now I want to fine tune my model on other dataset2, which is having a different shape (20,5,1). How to fine tune my model on this dataset2 which is having different shape compared to the dataset my model is trained on.
  2. This is other doubt I am having, VGG model is trained on data (For example, image of shape 244x244x3), Now If I want to fine tune on my custom dataset( image of shape 124x124x3) , I can load VGG model with input shape (124x124x3) and fine tune on my data, that is fine but how is the input shape of the architecture is changed because VGG is trained on image of 244x244x3. So, the shape of weight matrix between input layer and next layer will be of some shape compatible with 244x244x3 image, Now If I am loading VGG with input shape (124x124x3), Does the weight matrix between input layer and next layer will be removed? and replaced by new weight matrix compatible with 124x124x3 image and Does this new weight matrix is also trained or it is frozen?
Arjun Reddy
  • 158
  • 7
  • Regarding question 1: have you tried to pad dataset2 to get matching shapes? – Luca Anzalone Apr 07 '23 at 14:11
  • Yess, but some 2-3 features of dataset1 are different to dataset2 – Arjun Reddy Apr 07 '23 at 14:21
  • Well, then when you fine-tune you have to train all layers (because you want the feature extractor part to adapt too) but with a much lower learning rate, say at least 10x smaller (this is to avoid loosing completely what was previously learned.) – Luca Anzalone Apr 07 '23 at 14:38
  • yes, but how to use models, when we are having different shapes dataframes? – Arjun Reddy Apr 08 '23 at 07:46
  • I try to clarify my initial comment: you start from `dataset1` which is (28, 9, 1) in shape, you then train your model, next you take `dataset2` that is (20, 5, 1) and pad it (e.g. with zeros) to get a shape of (28, 9, 1). Finally you fine-tune on the padded data. Does it solve you doubt? – Luca Anzalone Apr 08 '23 at 09:09
  • yeah, got it but I have 2 doubts, 1.) since our model is trained on dataset1 and lets say dataset1 is having features like gender, income ( In class like Rich class, middle class, poverty) and lets say this features are not there in dataset2, so in gender if we consider female as 0 and male as 1. Then when we pad the dataset2 with 0s, does that change our dataset2 meaning and context, stating every datapoint in dataset2 is feamale asuming gender plays a very important role in prediction, 2.) What if out dataset2 shape is having more features like (44,99,6) when compared to dataset1? – Arjun Reddy Apr 08 '23 at 17:41
  • Good point: 1) to avoid the issue with gender, for example, you can pad with a value that is not meaningful (-1 perhaps). Anyway, even when padding with zero the model should find out that for the padded features these are always zero regardless the label, and so the weights associated to them should be smaller (maybe some l1-regularization should do the trick..) 2) If the shape is very different you should design a standard format for each datapoint across datasets, or try learning an intermediate model that "converts" a sample from one shape to another (but this is a bold idea..) – Luca Anzalone Apr 09 '23 at 17:20
  • Does this idea works? Training a model on dataset1, and removing top input layer and last output layer and replacing with layer compatible with shape of dataset2, like I change the top input layer and last output layer, just considering the body. Does this help in anyway? – Arjun Reddy Apr 09 '23 at 17:44
  • Well, I'm not sure but a *fully-convolutional network* (FCN) should work at least in principle. Such FCNs exploit the fact that the convolution can adapt to different input shapes (but I guess a minimum shape size should be enforced), and what you get is a variable output shape according to the given input shape. These FCNs were pretty popular for semantic segmentation some years ago, being able to be applied on different image sizes. If so, you don't have to change the input layer but in general you can't do this (because the weight matrix of the first hidden layer would change in shape.) – Luca Anzalone Apr 09 '23 at 17:58
  • Okay Got it.. Thanks a lot @LucaAnzalone – Arjun Reddy Apr 09 '23 at 19:08

1 Answers1

1

A1. There are different alternatives to fine-tune on a dataset with a different shape than the one used for training (assuming training dataset's shape is s1 and target shape is s2):

  • If the target shape is similar (i.e. slightly lower or greater): you can crop (if s2 is smaller than s1) or pad (if s2 is larger than s1). Padding is a simple method that works for tabular, sequence, and image data. The usual value is zero, but if that has a precise meaning from training dataset d1 you may want to pick a value that is meaningless: anyway, the network should learn that the padded value is always constant having no correlation with the targets.
  • If the target shape is quite different:
    • In the strict case of image data (note: restrictions about the learning framework can apply), a fully-convolutional network (FCN) - made of only convolutional layers, replacing even the Dense (or fully-connected) layers with $1\times 1$ convolutions to handle varying images sizes - may do the job but you still need to handle varying output size.
    • Perhaps, a more practical alternative would be to train an intermediate model that converts (i.e. tries to learn a compatible representation of) a sample with target shape s2 to source shape s1 such that you can reuse your trained model on d1. This, in principle, should work on any kind of data.

More details on the "intermediate model approach"

Say you have a trained model $f(x)$ on dataset d1 with shape s1, you want to transfer its knowledge to a model $f_2(\bar x)$, and then fine-tune on dataset d2 with shape s2. To do you learn a dataset-specific model $g:X\to \bar X$ that learns to represent samples from d2 as if they were from d1.

During training (but also inference) you would use the model $g$ as follows (in pseudo-code):

x_bar, y = sample from d2
x = g(x_bar)
y_hat = f_2(x)

# gradient step
loss = criterion(y, y_hat)
grads = loss.backward()
optimizer.apply_gradients(grads, weights=g + f_2)  # <--

Basically, you learn $g$ by letting the gradient of the error flow to it from $f_2$.

Such an approach should work well if the features in d1 are somehow correlated (even by some hidden relation) with the ones in d2: but, in general, if there is no correlation you can't transfer the learning at all!

If you want to try such approach I would suggest to follow a two-phase training strategy (assuming $f_2$ is a clone of $f$):

  1. Warmup: You freeze (i.e. make them fixed, or not learnable) the weights of $f_2$. Pick a small learning rate (the common 3e-4 may work well) and train $g$ on the target dataset until the error (yield by $f_2$) stabilizes (is not important that is low.) This is to ensure that when you later fine-tune the gradient magnitude is sufficiently low.
  2. Fine-tuning: according to the problem (i.e. how much different the features are in d2) you may want to train only the last output layer, the penultimate layers, or even all layers (if the features are quite different). Anyway, is important to use a much smaller learning rate (especially if re-training all layers): say 10-20x smaller than before, or compared to the lr used to learn f.

A2. If you change the shape of the input layer, the weight matrix of the next layer (i.e. first hidden layer) is usually dependent on that shape, and so it's very likely that that would change to. For convolutional layers, and so VGG, in principle, you should have that the conv layers adapt to the input size but what can be problematic are still the last dense layers.

Luca Anzalone
  • 2,120
  • 2
  • 13