On a very basic level, you are absolutely correct about the encoding of the attached sentence. But, practically, when you have a set n number of documents to be encoded, things happen differently.
Let's say we have n=1000
independent sentences.
Now, doing it by hand is a little difficult, but let's say we have to do it. What you will start with is making a list of all the unique words there are and assigning each of them a number, let's call this our vocabulary
. It might look like this
{
'the': 1,
'menace': 2,
'shirt': 3,
...
}
Now each possible word has a separate id assigned, you will go back to each of the sentences and encode them based on the number you assigned to each word in the last step. Note that in this method you need to be completely neutral about misspelled words assigning them different numbers as you find them. Of course, you can choose to correct the spelling if you wish, but you already are analyzing 1000 lines, why add one more task? Each line will now look something like what you mentioned in your question.
You can use this encoded information in algorithms.
But, to have better results, as mentioned you will need to correct the misspelled words, right? What if instead of n=1000
it was n=1,000,000
or like some trained models out there, with n=4 Billion documents
, you can't possibly think to correct the spellings by hand (manually) right?
So, researchers found more standardized methods. Essentially, we define a set of tokens, that might be a word like the
, hell
etc., or tokens (a sequence of letters) 1
, a
, b
, c
, blad
, ght
, etc. These tokens are designed based on multiple statistics the most important of which are length, and occurrence. Basically, if a word is too large it is taking space in the set, and if it is too rarely used, it can be broken into smaller tokens that might be a part of other words as well. The idea is to have tokens with similar frequency of occurrence when analyzed over a large corpus of data, but also include smaller tokens like a
, b
, etc. which can be used to account for new and previously unseen words. As you very correctly put it, computers understand numbers, and numbers can be anything as long as they make sense to computers.
These sets contain a vocabulary of tokens that can be used to encode any word. Here are some examples to help consolidate my points-
Revolution (correct spelling)
tokenized - ['rev', 'ol', 'u', 'tion']
Revoluton (incorrect spelling)
tokenized - ['rev', 'ol', 'u', 't', 'on']
Hi
tokenized - ['Hi']
and so on.
NOTE - The aforementioned tokens are only for illustration and might (probably will) differ from tokenizer to tokenizer.
So, now instead of defining our own tokenizers which need to be designed with multiple aspects and considerations, you can choose to use a standardized one already designed and available online. Some examples would be, NLTK's tokenizer, Spacy's tokenizer, Gensim, WordPeice (used for BERT) for your case.
Additionally, not having tokenizers has its own shortcomings, some of which are-
If you add independent words to the vocabulary, then the vocabulary quickly gets out of hand (consider that the English language has around 1 million words, not to mention other languages if you plan on training a translator). Great tokenizers have achieved good results with a token vocabulary of as small as 30,000 tokens.
As noted, the same words but wrong spellings hold space of their own and make the vocabulary unnecessarily longer. With standard tokens, you don't have to care about these misspelled words as they can be accounted for easily.
Vocabulary based on words will fail in case of new words, and so every time new information comes up, you will have to add a new element to the vocabulary. As was with misspelled words, new words can be accounted for with standard token-based vocabularies as well.
Hope this helps.