0

The acronym "iid" stands for "independent and identically distributed". It is a property of a sequence of random variables. You can read here for more details. This question is just about the usage of the word "iid" in contemporary machine learning and is not about the feasibility of checking iid based on either associated joint distribution or dataset.

In the formal and strict sense, the word "iid" should be used only as a property for a sequence of random variables based on the underlying joint probability distribution function. But, I noticed that there is another (maybe less-strict) usage for the word 'iid' based on the context.

Consider the following statements compiled from different answers to my questions 1,2

From this answer

The term i.i.d. is a property of a dataset. A dataset can be created that is i.i.d. with respect to a particular probability distribution. It doesn't matter what that distribution is, it just has to exist, and be relevant to the purpose the ML is being put to.

From this answer

The point is even you know the distribution, sometimes you can't prove that the sampled data is i.i.d. or not!....

From this answer

....A table of results of dice throws is likely iid...... (there are some issues with this answer, but the bolded excerpt is true)

So, the usage of the word iid, in this sense, is somewhat different. Although I think, iid is a property of a sequence of random variables in this sense also, it is okay to use the word 'iid' for a dataset (collection of samples) since the dataset represents some underlying probability distribution.

Thus, the two usages I am aware of up to now are

  1. iid for a sequence of random variables based on joint distribution.

  2. iid for a sequence of random variables based on the collection of samples.

Is my understanding of the two usages of the word "iid" correct? and are there any other usages for the word "iid"?

hanugm
  • 3,571
  • 3
  • 18
  • 50
  • @NeilSlater what about the bolded portion? – hanugm Sep 14 '21 at 07:43
  • 1
    "[A] table of results of dice throws is likely iid" seems correct to me (with reasonable assumptions). However, the non-bolded "it is because the dice roll itself is iid" is not correct – Neil Slater Sep 14 '21 at 07:46
  • 1
    Why do you get hung up on definitions of words so much? The important thing is the concept, not the association between words and concepts. Different people may use the same word with slightly different definitions, and that's fine. – user253751 Sep 14 '21 at 08:10
  • @user253751 It may be due to self-study. I am trying to understand underlying concepts. Unfortunately, these terminological issues are always popping up for me. – hanugm Sep 14 '21 at 08:25
  • 2
    consider that a sample *is* a random variable, in a way .When you sample random variable X 5 times it is the same as sampling from 5 random variables whose distribution is identical to the distribution of X and which are independent. – user253751 Sep 14 '21 at 08:30
  • Oh! seems to be intuitive. I will think on it... Then, it may be similar to unbiased sampling. @user253751 – hanugm Sep 14 '21 at 08:32

0 Answers0