2

I generated a bunch of simulation data from a complex physical simulation that spits out patterns. I am trying to apply unsupervised learning to analyze the patterns and ideally classify them into whatever categories the learning technique identifies. Using PCA or manifold techniques such as t-SNE for this problem is rather straightforward, but applying neural networks (autoencoders, specifically) becomes non-trivial, as I am not sure splitting my dataset into test and training data is the right way.

Naively, I was thinking of the following approaches:

  1. Train an autoencoder with all the data as training data and train it for a large number of epochs (overfitting is not a problem in this case perse I would think)

  2. Keras offers a model.predict option which enables me to just construct the encoder section of the autoencoder and obtain the bottleneck values

  3. Carry out some data augmentation and split the data as one might into training and test data and carry out the workflow as normal (This approach makes me a little uncomfortable as I am not attempting to generalize a neural network or should I be?)

I would appreciate any guidance on how to proceed or if my understanding of the application of autoencoders is flawed in this context.

nbro
  • 39,006
  • 12
  • 98
  • 176
Pavan Inguva
  • 71
  • 1
  • 2
  • What type of data do you have? What type of layers are you using for the auto-encoder? For instance, if you have time series data and use convolutional auto-encoders maybe the representation given by the hidden layers will not be as good as you expect. Is there any particular reason you want to use auto-encoders? – Uskebasi Jan 02 '21 at 08:51
  • The data is the final simulation state, i.e. the final image/pattern. The idea behind using an autoencoder is that the other dimensionality reduction techniques were unable to distinguish the patterns into distinct clusters. I should have mentioned this was a clustering problem rather than a classification problem. – Pavan Inguva Jan 03 '21 at 18:53

2 Answers2

0

When using an autoencoder, I believe the data u feed in has to be correlated in one way or another. For example, If i want to learn a latent representation of an image of a cat, The training data that I feed into the autoencoder should constitute only cat images.

Similar to other neural networks, you feed the autoencoder with a set of training data and hope that the network learns a set of weights that is able to output from the latent representation the exact image. To see whether the weights learnt by the autoencoder is able to generalise to other unseen cat images, you would have to use a test set for this.

Here is a paper about autoencoding. https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf

I hope this helps you somewhat in deciding whether you should use an autoencoder.

calveeen
  • 1,251
  • 7
  • 17
0

An Autoencoder helps you in learning an embedding space that can be used in a PCA or T-SNE to classify different categories of images in an unsupervised fashion. Since you are trying to reconstruct an input image, the model tries to learn an underlying distribution of patterns in your images which are useful for reconstructing and those patterns translate to an embedding space

For example, if you want an autoencoder to learn patterns in cat vs dog images and define the embedding space to be a 16 dimensional vector, the model will learn 16 different patterns that can help you in reconstructing the image.

In this case, it is better to balance your dataset with the classes you want to learn(have equal number of cat vs dog images in your training set) so that you don't induce bias(so that the model doesn't favour learning more about cats than dogs)

I would argue that overfitting is not a good thing for autoencoder because your embedding space will then get restricted only to your training domain(for ex, the whiskers of a cat in your training sample might be smaller when compared to the cats in your test sample, you don't want the embedding space to learn size of whiskers to that precision, but rather learn whiskers are important). It's important to have any model to generalize across different circumstances. I agree to the fact that you need to train the system for longer epochs to learn good representations.

I hope this gives you a better intuition on the parameters to consider to learn a good embedding space.