0

Background: I'm currently trying to use GPT to give me numerical scores, and looking for tips on prompt design, see my previous StackExchange post. To craft good prompts it seems important to have a good understanding of how the generative model works...

Question: How many tokens ahead does GPT 3.5 look with its beam search feature?

Extra context: I found it hard to find good references for beam search, a decent starting point seemed to be huggingface blog post.
I tried asking BingChat about GPT-3.5 beam search length: BingChat replied that it was 10 tokens but could only give a 'reference' to an OpenAI API page which did not seem to support the claim. I couldn't find any other results online.

Why I care? Suppose I have a long theatre review and want to score how impressed the critic was by the quality of acting on a scale of: -5 extremely unimpressed to +5 extremely impressed. My prompts currently ask the model to finish the reply with a sentence in the form: "Overall the critic was very impressed by the quality of acting - score 4." But perhaps by asking the model to continue generation I can make the prompts more reliable. E.g. I could ask the model to subsequently explain the score with a quotation from the text; e.g. along the lines of "Overall the critic was very impressed by the quality of acting - score 4/5 - and indeed that 'Mark Strong's performance stole the show'"

Knowing beam search length would really help me design prompts like these (which ask for the continuation of text after a numerical score to improve reliability).

1 Answers1

0

I don't quite get the connection between Why I care and Beam Search... it seems you are confused by what beam search is, since you talk about beam search length

However, beam search is about the width, not the length... at each timestep, the transformer outputs a distribution over tokens, thus for each token (for simplicity, think about them as words) you have a probability $p$ that that was the correct next word.

However, it's very hard for models to predict hard probabilities (0/1) since the data that it's been trained is noisy, so almost everything might come after everything

For this reason, if you just sample, you might sample a word that is with $p\approx 0$, but still greater than 0... for this reason, Beam search just limits this sampling on the next top-N words/tokens, thus at each timestep, you get the distribution over all the words, you take the top-N next words, you renormalize the distribution, and you sample from only those top-N words, avoiding going in area of the distribution that is unlikely just for a single bad sample

In other words, Beam search does not limit your output length, it's just a tradeoff between the greedy option (just take the most likely word) and the naive sample one (sample words according to the outputted distribution)

  • You are right! Thanks! Sorry for being an idiot! Yes I had misunderstood what beam search is. Presume I should keep the question up here? Or do you think I should take it down? – just another mathmo Aug 07 '23 at 10:14
  • This (new/revised/correct) beam search understanding makes a lot more sense and is actually very useful for my project - so thank you for making me realise I had misunderstood! – just another mathmo Aug 07 '23 at 10:16