# Simple Linear Regression

In one of the simplest forms of
supervised machine learning, the
computer learns a linear relationship between a single input and its
corresponding continuous-valued output. The training data consists of \(m\)
input examples \({ x^{(1)}, x^{(2)}, ..., x^{(m)} }\) and their corresponding output values,
\({ y^{(1)}, y^{(2)}, ..., y^{(m)} }\). **Simple linear
regression** means
finding the line that “best fits” the training examples—such that the *sum of
the distances from each example to the line is minimized.* This line is a
**model** that can then be used to generate outputs for new inputs.

The equation of any line can
be written as \(y(x) = \theta_0 + \theta_1x\) where \(\theta_1\) is the slope
of the line and \(\theta_0\) is its y-intercept. The actual distance from a
point \((x^{(i)}, y^{(i)})\) to a line is fairly
complex .
Fortunately, we can minimize this distance by minimizing *half the square of
its vertical component* \(\frac{1}{2}(y(x^{(i)}) - y^{(i)})^2\). (You’ll see why in a
moment.) For a line specified by \(\theta_0\) and \(\theta_1\), we want to
minimize:

Such a function is called the **cost** or
**loss** function. Its value is minimized for the best-fit model.
How do we find \(\theta_0\) and \(\theta_1\) that minimize this function? By finding the zeroes of its partial
derivatives
with respect to those variables.

(We used half the square of the vertical distance precisely to simplify the partial derivative equations. The term “partial derivative” may sound daunting, but it just calculates how rapidly one variable makes the function change, when the other variable is held constant. Near the point where both partial derivatives are zero, neither variable is affecting the function’s value. This tells us the function is at its minimum.)

The partial derivative with respect to \(\theta_1\) is simply

\[\frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} = \sum_{i=1}^m({\theta_0 + \theta_1x^{(i)} - y^{(i)}})x^{(i)}\]The partial derivative with respect to \(\theta_0\) is simply

\[\frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} = \sum_{i=1}^m{(\theta_0 + \theta_1x^{(i)} - y^{(i)})}\]For simple linear regression, it is possible to set these equations to zero and
solve for \(\theta_0\) and \(\theta_1\). To learn more complex models, we must
take a different approach called *gradient descent*.

# Summary

Click the arrows to see different ways of expressing the same thing: