Adam Optimizer

The Adam Optimizer is a stochastic gradient descent variation whereby the step size is updated at each iteration, independently for each weight, according (roughly) to a measure of the signal-to-noise ratio of the gradient of that weight in "recent" iterations. Intuitively, the higher the signal-to-noise ratio (SNR) of the gradient in recent iterations, the more stable such gradient is, and so the step can be bigger for that weight. Conversely, if the SNR is low, the gradient is more erratic, and therefore the step is smaller. The SNR is roughly the inverse (one over) the coefficient of variation, which itself is a scale-normalized version of the variance. In summary: the higher the scale-normalized variance of the gradient, the lower the SNR, and the lower the step update for that weight. The name Adam is derived from the term "adaptive moment estimation". There are variations in the particular formula for weight updates, such as the "Adam with decoupled weight decay" -- known as AdamW.