I came across this answer on Quora, but it was pretty sparse. I'm looking for specific meanings in the context of machine learning, but also mathematical and economic notions of the term in general.
-
I would encourage you to limit the scope of this question to just machine learning to avoid making this question being possibly too broad. – nbro Jan 20 '22 at 16:08
2 Answers
When formulating a problem in deep learning, we need to come up with a loss function, which uses model weights as parameters. Back-propagation starts at an arbitrary point on the error manifold defined by the loss function and with every iteration intends to move closer to a point that minimises error value by updating the weights. Essentially for every possible set of weights the model can have, there is an associated loss for a given loss function, with our goal being to find the minimum point on this manifold.
Convergence is a term mathematically most common in the study of series and sequences. A model is said to converge when the series $s(n) = loss_{w_n}(\hat y, y)$ (Where $w_n$ is the set of weights after the $n$'th iteration of back-propagation and $s(n)$ is the $n$'th term of the series) is a converging series. The series is of course an infinite series only if you assume that loss = 0 is never actually achieved, and that learning rate keeps getting smaller.
Essentially meaning, a model converges when its loss actually moves towards a minima (local or global) with a decreasing trend. Its quite rare to actually come across a strictly converging model but convergence is commonly used in a similar manner as convexity is. Strictly speaking rarely exists practically, but is spoken in a manner telling us how close the model is to the ideal scenario for convexity, or in this case convergence.

- 1,409
- 4
- 18
This is actually a highly technical term, which has been kind of misused and overgeneralized in many places.
What does 'convergence' mean in a literal sense? It simply means that a sequence of terms indexed by $\mathbb{N}$ ($X_1, X_2, X_3,..$) tends to a certain fixed value say $X$ as $\mathbb{N} \rightarrow \infty$, but may not achieve the fixed value. (there are a few technical details associated with this definition but I won't go into it as it requires some analysis)
When it comes to ML we are looking at probabilistic or stochastic models. When we talk about convergence in ML, we generally mean 4 types of convergence:
Convergence in Probability: This means that $\mathbb{N} \rightarrow \infty$ your likelihood of $X_N$ (a sequence of random variables) being very close to $X$ also increases i.e $P(\omega:[|X_N(\omega)-X(\omega)| >\epsilon]) \rightarrow 0$ as $N\rightarrow \infty$. This type of convergence is mostly used in Statistical Learning Theory.
Almost Sure Convergence: This means that $\mathbb{N} \rightarrow \infty$ your probability of $X_N$ (a sequence of random variables) being very close to $X$ is $1$ (NOTE: Here there is no likelihood of being close to $X$, we straight up say it must be close to $X$) i.e $P(\omega:[|X_N(\omega)-X(\omega)| >\epsilon]) = 0 $ as $N\rightarrow \infty$. This is a stronger verision of the previous convergence, and this is the type of convergence I have seen being used in RL.
Convergence in Distribution: This means that the distribution of a sequence of random variables tend to a certain distribution i.e $$\lim_{N\to \infty} F_{X_n} = F_X$$.
Convergence in $r$'th moment: This means that a sequence of random variables will converge to a certain mean as the sequence goes to infinity or simply put: $$\lim_{N\to \infty} \mathbb E[|X_N - \mu|^r] \ 0$$ where $\mu$ is the value to which the random variables in converge in $r$th moment.
A simple useful reference for all the aforementioned modes of convergence.
As a side note, this is meant as an informal reference, there is a lot of mathematical analysis involved in getting conditions for when these hold for a sequence of random variables.
In the context of ML, one can think of $L(w_N,y_N,x_N)$ in place of $X_N$ where $w_N, x_N, y_N$ will be the one deciding on the next step, and one can check if it satisfies any of the aforementioned convergence using some sufficient conditions. Note that when we do convex optimization we are talking about almost sure convergence (if the method used works), while for SGD due to stochasticity one might formulate it in the convergence in probability setting.
As a concrete example, the PAC learning paradigm uses the convergence in probability framework (without going into details the idea of PAC learning is that with increasing size of dataset your confidence about your classifier increases which can be interpreted as some sort of convergence in probability with the actual loss as the random variable, check the PAC learning framework here), while the Q-learning convergence (proof as suggested in the comments) is an almost sure convergence under some assumptions (probably proved by Bertsekas and Tsitsiklis), CLT is an example of convergence in distribution.
-
I would suggest that you use a consistent notation for all types of convergence. For example, use the "limits notation" when they are applicable. For example, in the last 2 cases, you use limits, but in the first 2 cases, you use "$\rightarrow \cdots $ as something $\rightarrow$", i.e. use the notation used by the reference https://www.probabilitycourse.com/chapter7/7_2_5_convergence_in_probability.php in all cases. I would also suggest that you use double `$$` to emphasize the formulas and their differences. – nbro Feb 04 '21 at 22:22
-
As an example of "convergence being used in RL", see e.g. http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf, which you can add to your answer as an example. I would also suggest that you provide the links to the references that support your other claims, like "PAC learning paradigm uses the convergence in probability framework". This answer can also be improved by distinguishing between r.v.s and realizations of r.v.s, etc. Convergence series is not exactly the same thing as the convergence of r.v.s., but you use "series of terms", which may be misleading. – nbro Feb 04 '21 at 22:25