How to use LSTM to generate a paragraph

Question

A LSTM model can be trained to generate text sequences by feeding the first word. After feeding the first word, the model will generate a sequence of words (a sentence). Feed the first word to get the second word, feed the first word + the second word to get the third word, and so on.

However, about the next sentence, what should be the next first word? The thing is to generate a paragraph of multiple sentences.

How do you know when the first sentence finishes? If punctuation is included in the "words", then you can just feed that into the model. Or otherwise the last word of the previous sentence. — Oliver Mason, Jan 02 '20 at 11:36
@datdinhquoc similar to what Oliver just mentioned, but really its how it was trained. This may include just using the last token in the last sentence (probable if punctuation is included) or a seperator token or etc. Usually how it was trained will implicate how inference will work — mshlis, Jan 02 '20 at 16:50
interesting, use the end-of-sentence punctuation to generate next sentence — Dee, Jan 03 '20 at 03:24

score 3 · Accepted Answer · answered Jan 02 '20 at 21:07

Take the sentence that was generated by your LSTM and feed it back into the LSTM as input. Then the LSTM will generate the next sentence. So the LSTM is using it's previous output as it's input. That's what makes it recursive. The intial word is just your base case. Also you should consider using GPT2 by open AI to do this. It's pretty impressive. https://openai.com/blog/better-language-models/

Clement · Answer 2 · 2020-01-03T11:06:41.697

As you know, an LSTM language model takes in the past word and tries to predict the new one and continue over a loop. A sentence is divided into tokens and depending on different method, the tokens are divided differently. Some model maybe character based models which simply uses each character as input and output. In this case you can treat punctuation as one character and just run the model as normal. For word based model which is commonly used in many systems, we treat punctuation as it's own token. It is commonly called a end of sentence token. There is also a specific token for end of output. This makes the system knows when to finish and stop prediction.

Also, just so you know for language model trying to generate original text, they feed the output as the input of the next data point, but the output they choose is not necessarily the one with the best accuracy. They set a threshold and choose upon that. It can introduce diversity to the language model so taht even though the staring word is the same, the sentence/paragraph will be different and not the same one again and again.

For some state-of-the-art models, you can try GPT-2 as mentioned by @jdleoj23 . This is a character based(actually byte based but basically the same, it treats each unicode symbol individually) model that uses attention and transformers. The advantage of character based system is that even inputs that have spelling errors can be inputted into the model and new words not in the dictionary can be introduced.

However if you want to learn more about how language model works, and not just striving for the best performance, you should try implementing one simple one by yourself. You can try following this article which uses keras to make a language model. https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

The advantage of making a simple one is taht you can actually understand the encoding process, the tokenization process, the model underneath and others instead of relying on other people's code. The article uses keras Tokenizer but you could try writing your own using regex and simple string processing.

Hope my help is useful for you.

How to use LSTM to generate a paragraph

2 Answers2