What are the pros and cons of LSTM vs Bi-LSTM in language modelling? What was the need to introduce Bi-LSTM?
1 Answers
I would say that the logic behind the introduction was more empirical than technical. The only difference between LSTM and Bi-LSTM is the possibility for Bi-LSTM to leverage future context chunks to learn better representations of single words. There is no special training step or units added, the idea is just to read a sentence forward and backward to capture more information.
And as trivial as the idea sounds, it works, in fact, in the original paper the authors managed to achieve state-of-the-art scores in three tagging tasks, namely part-of-speech tagging, chunking and named entity recognition.
Even though it must be said that these scores were not dramatically higher compared to other models, and also the complete architecture included a Conditional Random Field on top of the Bi-LSTM.
Probably the most important aspect to stress out is that the authors performed two interesting comparison tests: one using random embedding initialisation and another one using only words (unigrams) as input features. Under these two test conditions, Bi-LSTM (with CRF on top) outperformed significantly all other architectures, proving that Bi-LSTM representations are more robust than representation learned by other models.
I would like also to make a side note regarding human reading. It makes sense to consider unidirectional sequence models as the most reasonable to emulate human reading, because we experience reading as a movement of the eyes that goes from one direction to the opposite. But the reality is that saccades (really rapid unconscious eye movement) and other eye movements play an enormous rule in reading. Which means that also we humans do continuously look to past and future words as well in order to understand the purpose of a word or sentence we're processing. Of course in our case these movements are directed by implicit knowledge and habits that allow us to direct our attention only to important words/parts (for example we barely read conjunctions) and it is interesting to notice that now state-of-the-art models based on transformers try to learn exactly this, where to pay attention rather than single probabilities for each word in a vocabulary.

- 39,006
- 12
- 98
- 176

- 5,153
- 1
- 11
- 25