In one of the simplest forms of supervised machine learning, the computer learns a linear relationship between a single input and its corresponding continuous-valued output. The training data consists of $m$ input examples ${ x^{(1)}, x^{(2)}, ..., x^{(m)} }$ and their corresponding output values, ${ y^{(1)}, y^{(2)}, ..., y^{(m)} }$. Simple linear regression means finding the line that “best fits” the training examples—such that the sum of the distances from each example to the line is minimized. This line is a model that can then be used to generate outputs for new inputs.

The equation of any line can be written as $y(x) = \theta_0 + \theta_1x$ where $\theta_1$ is the slope of the line and $\theta_0$ is its y-intercept. The actual distance from a point $(x^{(i)}, y^{(i)})$ to a line is fairly complex . Fortunately, we can minimize this distance by minimizing half the square of its vertical component $\frac{1}{2}(y(x^{(i)}) - y^{(i)})^2$. (You’ll see why in a moment.) For a line specified by $\theta_0$ and $\theta_1$, we want to minimize:

Such a function is called the cost or loss function. Its value is minimized for the best-fit model. How do we find $\theta_0$ and $\theta_1$ that minimize this function? By finding the zeroes of its partial derivatives with respect to those variables.

(We used half the square of the vertical distance precisely to simplify the partial derivative equations. The term “partial derivative” may sound daunting, but it just calculates how rapidly one variable makes the function change, when the other variable is held constant. Near the point where both partial derivatives are zero, neither variable is affecting the function’s value. This tells us the function is at its minimum.)

The partial derivative with respect to $\theta_1$ is simply

The partial derivative with respect to $\theta_0$ is simply

For simple linear regression, it is possible to set these equations to zero and solve for $\theta_0$ and $\theta_1$. To learn more complex models, we must take a different approach called gradient descent.

# Summary

Click the arrows to see different ways of expressing the same thing:

 the line that “best fits” the training examples the line such that the sum of the distances from each example to the line is minimized the y-intercept $$\theta_0$$ and slope $$\theta_1$$ such that the sum of the distances from each example to $$y(x) = \theta_0 + \theta_1x$$ is minimized the y-intercept $$\theta_0$$ and slope $$\theta_1$$ such that the sum of the distances from $$(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})$$ to $$y(x) = \theta_0 + \theta_1x$$ is minimized $$\theta_0$$ and $$\theta_1$$ such that the sum of the vertical distances from $$(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})$$ to $$y(x) = \theta_0 + \theta_1x$$ is minimized $$\theta_0$$ and $$\theta_1$$ such that $$\sum_{i=1}^m{(y(x^{(i)}) - y^{(i)})}$$ is minimized $$\theta_0$$ and $$\theta_1$$ such that $$\sum_{i=1}^m{\frac{1}{2}(y(x^{(i)}) - y^{(i)})^2}$$ is minimized. $$\theta_0$$ and $$\theta_1$$ such that $$\frac{1}{2}\sum_{i=1}^m{(\theta_0 + \theta_1x^{(i)}- y^{(i)})^2}$$ is minimized $$\theta_0$$ and $$\theta_1$$ such that $$\frac{\partial}{\partial \theta_0}\frac{1}{2}\sum_{i=1}^m{(\theta_0 + \theta_1x^{(i)}- y^{(i)})^2} = 0$$ and $$\frac{\partial}{\partial \theta_1}\frac{1}{2}\sum_{i=1}^m{(\theta_0 + \theta_1x^{(i)}- y^{(i)})^2} = 0$$ $$\theta_0$$ and $$\theta_1$$ such that $$\sum_{i=1}^m{(\theta_0 + \theta_1x^{(i)} - y^{(i)})} = 0$$ and $$\sum_{i=1}^m({\theta_0 + \theta_1x^{(i)} - y^{(i)}})x^{(i)} = 0$$