Making generated texts from "data-to-text" more variable

Question

I am diving in data-to-text generation for long articles (> 1000 words). After creating a template and fill it with data I am currently going down on paragraph level and adding different paragraphs, which are randomly selected and put together. I also added on a word level different outputs for date, time and number formats.

The challenge I see is, that when creating large amounts of such generated texts they become boring to read as the uniqueness for the reader goes down.

Furthermore, I also think it's easy to detect that such texts have been autogenerated. However, I still have to validate this hypotheses.

I was wondering if there is an even better method to bring in variability in such a text?

Can you suggest any methods, papers, resources or share your experience within this field.

I highly appreciate your replies!

score 1 · Answer 1 · answered Nov 30 '20 at 03:53

The state of the art in text generation is the GPT model. GPT-3, which was just released in summer of 2020, has been used to generate many very impressive articles, and is widely considered the best text generation model. This article and this one should give you an example of how powerful it is at text generation.

GPT is a transformer based architecture, somewhat like BERT. The main difference is it only takes into account left context, which is why it is so well suited for text generation.

GPT-3 is still very new and is not available for free. However, GPT-2, the previous release of the model, is available for free. While obviously not as advanced as GPT-3, it is still quite impressive in its own right, and for someone trying to generate text, it is the clear choice.

Here is a link to a tutorial explaining the basics to get you started with GPT-2.

If you are interested in diving into the relevant research papers:

Here are the open AI papers on GPT 1-3

Additionally, if you have never seen transformers before, take a look at:

Attention Is All You Need

score 1 · Accepted Answer · answered Sep 20 '21 at 10:56

You could handwrite different templates and choose probabilistically, according to writing style or pragmatic effects like irony and so on, but that very much depends on the domain. If you have tabular data, from which you want to generate text, you should probably forget about GPT and so on. You only have few control (despite copy mechanism) over the generation process, you actually predict the next most probable word sequence for a given length - GPTs don't author coherent text across paragraphs, especially not when text length is more than a few hundred words.

Check the linguistic counterpart (https://arxiv.org/abs/1703.09902) to end2end generation systems. Breaking up the networks, pipelining again and using networks for controllable tasks. e.g. build a module that selects which attributes to produce. Create RDF triples from the column head words and values in your data base. Take a Text to text Transfer model (Google's T5) and transform into surface text. You should also have a look into the webNLG challenge (https://webnlg-challenge.loria.fr/challenge_2020/). This might help. Much of this is still open research. I am quite busy in this topic, so feel free to ask.

Thx for your reply! I am also a researcher. Is there a way to get in touch with you? — Carol.Kar, Sep 22 '21 at 19:36
VIa Mail? Don't know whether stackexchange offers private messages, I think these comments will soon get deleted:) — Langueneers, Sep 23 '21 at 12:14
Btw if you are interested for an exchange message me on [email protected] — Carol.Kar, Sep 25 '21 at 12:49

Making generated texts from "data-to-text" more variable

2 Answers2