3

In the website the following explanation is provided about Embedding layer:

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used in a variety of ways, such as:

It can be used alone to learn a word embedding that can be saved and used in another model later. It can be used as part of a deep learning model where the embedding is learned along with the model itself. It can be used to load a pre-trained word embedding model, a type of transfer learning.

Isnt embeddings model specific? I mean to learn a representation of something we need the model that something was used to represent! so how can embeddings learned in one model can be used in another?

Oculu
  • 43
  • 2

2 Answers2

2

An embedding layer is a linear layer that is used to convert a discrete input into a vector of a fixed size, d. Learned embedding layers are often used in natural language processing (NLP). Common learned embeddings are GloVe, GoogleNews, or word2vec. These embeddings have often been trained on huge amounts of data (eg 3 billion words for GoogleNews), and as such can be applied over a wide variety of contexts (although you might need specific embeddings for specific tasks). These embeddings were all obtained by training neural networks such that words that occur in similar contexts have a similar embedding (you may look up the individual training algorithms for a better understanding).

Words that are similar to each other are closer to each other in d-dimensional space (where d is the vector size), whereas dissimilar words are far apart. This distance can be measured by cosine-similarity. Research has shown that using pre-trained embeddings usually improves model performance in many NLP tasks (https://arxiv.org/pdf/1804.06323.pdf).

If you do use pre-trained word embeddings, make sure that the vocabulary of your training data is present in the embeddings. During neural network training when using pre-trained embeddings, it makes sense to freeze the weights of the embedding to prevent superfluous gradient computation (and ruin your embeddings). However, if some words present in your training data are not present in the embedding, you may consider only partially freezing the weights of your embedding.

  • I think you misunderstood my question, I am asking that how embeddings learned in one model can be used in another? shouldnt embeddings be model specific? – Oculu Oct 20 '22 at 22:44
  • I'm afraid I don't understand what you mean. An embedding is simply a linear layer, you can just load its weights. See: https://keras.io/examples/nlp/pretrained_word_embeddings/ under the section "Load pre-trained word embeddings" for an example in Keras. – postnubilaphoebus Oct 20 '22 at 22:54
  • 1
    I have put in the quotation that as per the the website it is mentioned `It can be used alone to learn a word embedding that can be saved and used in another model later` this is what my concern is how it can be used in another model? Shouldnt the embeddings learned is model-specific i.e. it can be used only in that model? – Oculu Oct 20 '22 at 23:54
  • 1
    No, embeddings aren't model-specific. Why would they be? – postnubilaphoebus Oct 21 '22 at 00:03
  • Because different models require different input vector size and the way vector algebra works, if we change model then different parameters are needed for it to function, so if your model uses tanh as activation function and if you change to sigmoid, I am sure the embeddings need to be changed – Oculu Oct 31 '22 at 13:54
  • I see your issue now. You can just use PCA to adjust that (at least when your embedding dimension of your model is lower than the emb dimension of the pretrained embeddings). When it is the other way around, consider adapting the input dimension of your model. And I'm not sure why you think activation functions matter here. Maybe it's a good idea you explain how you think it works and we can spot any misconceptions. – postnubilaphoebus Oct 31 '22 at 14:57
  • So basically my understanding is that embedding vectors are representation of the input learned by training some specific model, so if the model is to change say just only the activation function changed from tanh -1 to +1 to sigmoid 0 - 1 so the response by using embeddings learned by tanh-based model will be different by using same embeddings in sigmoid-based model which will affect the output. I think what the original text mentioned is wrong. – Oculu Oct 31 '22 at 22:58
  • The activation function is not relevant here. An embedding is the first layer of a neural network. So if you train a neural network to have a certain embedding, it doesn't matter what activation functions it has in later layers. The output also doesn't matter here. Maybe reading up on how word2vec (the model) got trained would help, specifically CBOW and skip-gram. – postnubilaphoebus Nov 01 '22 at 01:11
  • Isnt embeddings learned by backpropogation? So the type of activations used affects the outcome? – Oculu Nov 01 '22 at 12:07
  • Backpropagation simply means to calculate the gradient of a loss function with respect to a model's variables. Upon obtaining the gradients, gradient descent is used for the model to step into the direction of the gradients, i.e. to change the model weights at each layer according to the weight update. This happens regardless of the activation functions used. What matters is that the outcome, i.e. the trained embedding, meaningfully separates words in d-dimensional space (where d is the vector size). – postnubilaphoebus Nov 01 '22 at 16:40
  • I recommend you talk to a professor to understand this better, and explain your understanding of how embeddings are trained as detailed as possible. That way, they can quickly help point out what you are missing, because embeddings are definitely model-independent. – postnubilaphoebus Nov 01 '22 at 16:42
  • 1
    I'm on the same page of @Oculu , I don't understand how embeddings can be so model friendly. I get how back propagation works, the point is that any model at the end of the training will generate different weights based on their trainingset and specification, so even if the layer size of different model matches, the weights in that layer doesn't, so how can an embedding (so as you say the first layer of the model) be compatible with the first level of another model? Did we have any reference to a specific explaination about this? – Not Important May 04 '23 at 12:41
  • fun fact, I asked to ChatGPT 4: response was too long to post here but clarified me some concepts (need to check hallucinations). Basically there are different approaches with pros and cons, here a ref to go deeper: Embeddings learned by one model can be used in another through transfer learning techniques such as pre-trained embeddings, fine-tuning, embedding layer sharing, and knowledge distillation. I think that the doubt me and @Oculu were sharing was on point, it's the implementation of embeddings and the models that matters – Not Important May 04 '23 at 12:49
  • Also was point of doubt could arise about how these embeddings are used, because usually you use them to index external document sources and then search a user query against a vector DB containing these indexes, so it doesn't matter from where these embeddings are coming from as long as you index all documents using the same embeddings. Embeddings resoults aren't something you give back to the model – Not Important May 04 '23 at 12:58
2

To answer your one question: Are embeddings model-specific? YES! They are. I am not going to invoke math or other techniques here. My explanation is going to be in a intuitive perspective. I don't know if the current literature and jargons will agree to my usage. But I get your question.

Take a scenario where you trained a CNN to classify smileys or emjoies based on their positivity or negativity. So it's a binary classification problem. Say you achieved a very good model. The penultimate layer for this model will give you a higher dimensional "embedding" vector for any new emojie or smiley pic you are feeding to this CNN.

Can you compare these embedding with this another ANN model's results arising from the same training and testing data? Technically you can't.

But by coincidence it might be very related. How do you verify it? You can use the argument of @postnubilaphoebus to do a PCA on the embedding data so that you can compare these two embeddings (by reducing the dimensions to match between both embeddings results). And then if all inputs which you feed to both models and then PCA-ing the corresponding intermediate penultimate layer outputs gives you pairs of "embeddings". Do a dot product analysis to compare them. You will know if they say the same thing or are very different.

I am not a NLP or LLM person, but this philosophy will work for any models. Thank you. I hope you got an idea. If not I am willing to explain it further.

dexterdev
  • 121
  • 3
  • AI noob here. What's a PCA? – darKnight Aug 28 '23 at 18:57
  • PCA = Principal Component Analysis. https://en.wikipedia.org/wiki/Principal_component_analysis#:~:text=Principal%20component%20analysis%20(PCA)%20is,the%20visualization%20of%20multidimensional%20data. – dexterdev Sep 01 '23 at 06:19