Learning Rate Caution

When you program a computer to use gradient descent, you must choose a learning rate. If the learning rate is too small, then many iterations might be required before model converges. If it is too large, then the model may never converge. To illustrate this, we use gradient descent to find the minimum of a very simple function, \(f(x) = x^2\).

The blue line is the graph of the function. According to step 1., we start by guessing that the minimum is at \(x = 8\). We calculate the gradient \(\frac{\partial f(x)}{\partial x} = 2x\) and then adjust our guess by the product of the gradient and the learning rate. We repeat adjustments until the gradient becomes very close to zero.

During the repeated adjustments, different values of \(x\) are generated. The black line shows the value of the function for these values.

Move the slider to adjust the learning rate and see how long the model takes to converge—if it converges at all.



learning rate: