Is it okay to think of any dataset in artificial intelligence as a mathematical set?

Question

A dataset is a collection of data points. It is known that the data points in the dataset can repeat. And the repetition does matter for building AI models.

So, why does the word dataset contain the word set? Does it have any relation with the mathematical set, where order and repetition do not matter?

[page:#9](https://www.deeplearningbook.org/contents/ml.html) — hanugm, Aug 29 '21 at 05:22

nbro · Accepted Answer · 2020-11-28T20:10:29.840

It's true that your original dataset can contain duplicates, so it should not be called a set, in order to be consistent with the mathematical definition of a set. There are mathematical objects known as multi-sets that can contain duplicates, but the order of the elements is still not relevant. There are also tuples and sequences, where the order of the elements matters.

If you want to get rid of the duplicate elements in your dataset, you could perform a pre-processing step where you remove them. Even if you do that, it is often the case that, if you are learning with mini-batches (i.e. using mini-batch stochastic gradient descent), these mini-batches could contain the same elements, because you may sample the same element in different batches or even in the same batch (this is known as sampling with replacement). Of course, this depends on how you sample your training dataset to build the batches (or mini-batches). So, if you do not want duplicates even in the mini-batches, you need to perform sampling without replacement.

Moreover, there are datasets that contain elements whose order in the dataset can be relevant for the predictions, such as datasets of time-series data, while, in mathematical sets and multi-sets, the order of the elements does not matter.

So, yes, it is often called a dataset (or data set), but it is not necessarily a set in a mathematical sense. In general, it should just be interpreted as a collection of data. In scenarios where the order of the elements or the existence of duplicates in the dataset (or any other information or property of your collection of data) is relevant, you should probably emphasize/note it.

Is it okay to think of any dataset in artificial intelligence as a mathematical set?

1 Answers1

Linked