2 Answers2

2

The adjective bad isn't mathematically descriptive. A better term is sub-optimal, which implies the state of learning might appear optimal based on current information but the optimal solution from among all possibilities is not yet located.

Consider a graph representing a loss function, one of the names to measure disparity between the current learning state and the optimal. Some papers use the term error. In all learning and adaptive control cases, it is the quantification of disparity between current and optimal states. This is a 3D plot, so it only visualizes the loss as a function of two real number parameters. There can be thousands, but the two will suffice to visualize the meaning of local versus global minima. Some may recall this phenomenon from pre-calculus or analytic geometry.

If pure gradient descent is used with an initial value to the far left, the local minimum in the loss surface would be located first. Climbing the slope to test the loss value at the global minimum further to the right would not generally occur. The gradient will in most cases cause the next iteration in learning algorithms to travel down hill, thus the term gradient descent.

This is a simple case beyond just the visualization of two parameters, since there can be thousands of local minima in the loss surface.

Error Surface Showing How Global Optimum Can be Missed

There are many strategic approaches to improve the speed and reliability in the search for the global minimum loss. These are just a few.

  • Descent using gradient, or more exactly, the Jacobian
  • Possibly further terms of the multidimensional Taylor series, the next in line being related to curvature, the Hessian
  • Injection of noise, which is at an intuitive (and somewhat oversimplified) level is like shaking the graph so a ball representing the current learning state might bounce over a ridge or peak and land (by essentially trial and error) the global minimum — Simulated annealing is a materials science analogy and involves simulating the injection of Brownian (thermal) motion
  • Searches from more than one starting position
  • Parallelism, with or without an intelligent control strategy, to try multiple initial learning states and hyper-parameter settings
  • Models of the surface based on past learning or theoretically known principles so that the global minimum can be projected as trials are used to tune the parameters of the model

Interestingly, the only way to prove that the optimal state is found is to try every possibility by checking each one, which is not nearly feasible in most cases, or by relying on a model from which the global optimal may be determined. Most theoretical frameworks target a particular accuracy, reliability, speed, and minimal input information quantity as part of an AI project, so that no such exhaustive search or model perfection is required.

In practical terms, for example, an automated vehicle control system is adequate when the unit testing, alpha functional testing, and eventually beta functional testing all indicate a lower incidence of both injury and loss of property than when humans drive. It is a statistical quality assurance, as in the case of most service and manufacturing businesses.

The graph above was developed for another answer, which has additional information for those interested.

Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62
0

As mentioned in the abstract of on of these papers, bad local minima is a suboptimal local minimum which means a local minimum that is near to a global minimum.

OmG
  • 1,731
  • 10
  • 19