How do I use GPT-2 to summarise text?

Question

In section 3.6 of the OpenAI GPT-2 paper it mentions summarising text based relates to this, but the method is described in very high-level terms:

To induce summarization behavior we add the text TL;DR: after the article and generate 100 tokens with Top-k random sampling (Fan et al., 2018) with k=2 which reduces repetition and encourages more abstractive summaries than greedy decoding. We use the first 3 generated sentences in these 100 tokens as the summary.

Given a corpus of text, in concrete code terms (python preferred), how would I go about generating a summary of it?

I think the referenced article explains clearly how to do it: http://www.aclweb.org/anthology/P18-1082 — OmG, Mar 03 '19 at 13:36

Andre Goulart · Answer 1 · 2023-06-29T03:10:31.857

In short: It depends.

Where will you run it?

On Premises: You may want want to run in your own environment.
IaaS: GPT models are often too big, so people might prefer to setup a different server for that, serving your API.
PaaS: If it's more experimental, I recommend running it on Google Colab.
SaaS: Or even use some external API, so you don't need to worry about this setup and just use it as a service. (easiest)

Each approach demands a different architecture and a different code.

Once you've setup the environment / API, you'll run it by providing an initial prompt and some parameters:

GPT

GPT-2 (any GPT model) is a general, open-domain text-generating model, which tries to predict the next word for any given context.

So, setting up a "summarize mode" is not just flagging a parameter. It's a non-deterministic process and requires tries and errors.

The GPT setup is experimental:

You use a sandbox.
Create an initial prompt.
Set some parameters (Temperature, Top-P, Top-K,...)
Evaluate your results
Adjust the prompt
Adjust the parameters
Until you consistently achieve the desirable results.

The prompt

1. Simple way (no shot learning)

A very simple way to do it is something like:

prompt = text+"\nTL;DR:"

It should work specially well if your text is simple and small.

2. Explicit Introduction (no shot learning)

You may have a better result if initially prompt some context, for example:

"Here is a text and it's respective summary.\n"+
"#Full text:" + text + "\n"+
"#Summary:"

3. Add examples (few shot learning)

Another approach is filling your prompt with a few (high-quality) examples before your original prompt:

prompt = sample_1+"\nTL;DR:"+summary_1
prompt += "\n###\n"
prompt += sample_2+"\nTL;DR:"+summary_2
prompt += "\n###\n"
prompt += your_input+"\nTL;DR:"

And you realize the next logical thing for the generator to make is a summary of your input.

The results will greatly depend on your text size and style. So you should find what prompt template best suits your needs.

Execution

Keep in mind that GPT will not learn from previous executions. It has no memory or learning in between executions, which means that each input will require the whole prompt again.

So your main program should run a loop and prompt GPT once for every text file. Something like:

for file in glob(./files.txt)
  text = open(file, "r")
  prompt = text+"\nTLDR:\n"
  result = GPT(prompt, parameters)

Other tips to keep in mind:

If you have no examples:

Results may be different from desired (Too big, too short, too informal, omit something you consider important)
Be less stable. (Nail it sometimes and ruin other times).

If you have some examples, it should replicate your example style. But:

It may wrongly make references to the example (instead of the desired text).
Gets confused if your text does not match the example(s) lenght / style.
Prompt may get too big.

If the original text is too big:

It will require more computational power.
It might extrapolate the limit of tokens it can digest. (So you might need to split the text into chapters).
The model might get confused, like: Summarizing only the last paragraph (so you need to make it clear when the text starts and ends).

You wrote TD;DR each time you meant TL;DR – user253751 Aug 18 '21 at 10:20 — user253751, Aug 18 '21 at 10:20