Gradient Descent
For simple linear regression, it is possible to minimize the cost function by setting its partial derivative equations to zero and then solving for the variables \(\theta_0\) and \(\theta_1\).
In modern machine learning, the computer learns vastly more complex, nonlinear models. (You’ll encounter them later.) These models sometimes describe the relationships between millions of inputs and thousands of outputs. To learn such models, we can still define a cost function and find its partial derivative equations. However, setting them to zero and solving for the variables may be impossible or excessively timeconsuming.
The alternative method to minimize the cost function is gradient descent.

We start with some (possibly random) values for the variables.

We calculate the partial derivatives of the cost function using those values. These values are called gradients because each is the rate of change (slope of the tangent) of the cost function based on the change in one variable.

We adjust each variable value by an amount directly proportional to its corresponding gradient. (Of course, if the gradient is zero, then the variable is already “doing its part” to minimize the cost function, and we need not change its value.) The constant of proportionality is called the learning rate.

We repeat steps 2. and 3. until all the gradients are zero (or very close to it). When this happens, we loosely say that “the model has converged.” This really means we have arrived at a set of values that minimize the cost function for the training set.
If you tried to minimize the linear regression cost function interactively, you already have a good sense for gradient descent. You may have moved the sliders rapidly until the line came close to the data set. During this time, the cost function decreased rapidly. Then, could make fine adjustments, paying close attention to the value of \(J(\theta_1, \theta_2)\).