How was the token list determined for the tokenizer "cl100k_base"?

Question

Does it have something to do with smoothing out the token frequencies to a desired distribution? If so, what's that distribution? And how is it achieved?

Is there a separate paper about it? Or should I just dig through LLM papers?

score 1 · Accepted Answer · answered Jul 21 '23 at 17:56

Please look at my answer here before reading further.

If you look at Tensorflow's TextVectorizer, you will find a keyword argument max_tokens. Now when you adapt your textual dataset to the TextVectorizer object, it tries to limit the vocabulary size to max_tokens. Essentially, it's a statistical method, and the lower the value of the max_tokens, the more words will be broken down into abstract tokens.

Building on that, and as an illustrative example-imagine your dataset strictly contains only English words, and no punctuation marks whatsoever, and you limit your vocabulary size to 27, guess what would happen?

Your final output vocabulary will be something similar to this-

{

    'a', 'b', 'c', ... ' '

}

There is one element in the token which is a space character.

In the end, the vocabulary becomes a set of tokens, using which any phrase/word/sentence/paragraph in the training data can be re-constructed.

The ^ example is strictly an example, as different methods have their own standardizations, e.g. space is mostly never represented as ' ' in a token, but with a representation like a# that means after a ends put a space before the next token, etc.

Now, there are many variables at play here. Ideally, we want as many tokens as possible, because the more tokens we have, theoretically, the fewer abstractions our neural network would need to learn, however, with more tokens, the time complexity rises exponentially.

So, different methods have architectured different architectures to define tokens, and are available openly for commercial use. Take for example BERT's WordPiece, or GenSim (NLP library). Also keep in mind that WordPiece is a learner tokenizer, i.e. with some predefined variables, it was trained iteratively on data (Very much like TensorFlow's TextTokenizer). On the other hand, there are also statistics-based tokenizers as Stanford's word tokenizer.

How was the token list determined for the tokenizer "cl100k_base"?

1 Answers1