Consider the following paragraphs from the introduction of the chapter named Recurrent Neural Networks from the textbook titled Dive into Deep Learning
So far we encountered two types of data: tabular data and image data. For the latter we designed specialized layers to take advantage of the regularity in them. In other words, if we were to permute the pixels in an image, it would be much more difficult to reason about its content of something that would look much like the background of a test pattern in the times of analog TV.
Most importantly, so far we tacitly assumed that our data are all drawn from some distribution, and all the examples are independently and identically distributed (i.i.d.). Unfortunately, this is not true for most data. For instance, the words in this paragraph are written in sequence, and it would be quite difficult to decipher its meaning if they were permuted randomly. Likewise, image frames in a video, the audio signal in a conversation, and the browsing behavior on a website, all follow sequential order. It is thus reasonable to assume that specialized models for such data will do better at describing them.
In neural networks, we generally use words: instances, examples, data points to refer to a particular row of a dataset. In general, in the case of CNN, an instance will be an image, and in the case of RNN, an instance will be a sequence of text (maybe a sentence, paragraph, or text).
Every instance contains features: in image data, pixels are generally treated as features and in the case of text data, either characters or words are generally treated as features.
With this context, let me explain my doubt
I have an issue understanding the paragraphs. The issue is the comparison of examples/instances of image data with features of text data. Pixels can be compared to words/characters and images can be compared to sentences or words or paragraphs or text documents based on the context.
In the second paragraph, it is said that the examples (images) are i.i.d. But then the images are compared with words in a paragraph, but words are features, not examples. As the paragraph is not saying that pixels are i.i.d., how can the words in paragraphs be used for comparison?