In the last two posts, we looked at implementations of polynomial logistic regression using gradient descent for 30 examples, each with two inputs and and each in one of three classes.

The five terms in the feature vector correspond to the constant and the polynomial coefficients for , , and respectively.

If you looked closely at the Python or Julia code that implemented multi-class logistic regression, you may have noticed some math in the feature vector for each example. The feature vector took each input, subtracted 50, and then divided by 50:


self.feature_vector = numpy.array([[1], [(x1 - 50) / 50.0], [(x2 - 50) / 50.0], [((x1 - 50) / 50.0) ** 2], [((x2 - 50) / 50.0) ** 2]])

  [[1.0], [(x1 - 50.0) / 50.0], [(x2 - 50.0) / 50.0], [((x1 - 50.0) / 50.0)^2], [((x2 - 50.0) / 50.0)^2]],   

The reason for this is that gradient descent converges _faster when all the
features are within a similar range._

The inputs $$x_1$$ and $$x_2$$ were in the range 0--100. Therefore, the
features $$x_1^2$$ and $$x_2^2$$ would have been in the range 0--10,000. If we
had used these without modification, then the cost function would have looked
like a long narrow channel. To properly move across the width of the channel
would have required a very small learning rate, and to find the minimum pont
along the length of the channel would have taken many iterations.

By subtracting 50 and dividing by 50, we ensured that all features were in the
range -1 and 1. This is called **feature scaling** and it makes the cost
function look more like a shallow bowl. We can use a much larger learning rate
and find the minimum of the cost function in a much smaller number of

You might be concerned that the $$\theta$$ values you end up with are trained
from different examples than the set you started with. However, it is a simple
matter to scale _any_ input in the same way you scaled the examples, and use
these $$\theta$$ values to get the correct classification results.

Try replacing one of the features with the unmodified input, and then run the
program again. You will see that the cost explodes---the model diverges. You
will have to reduce the learning rate, possibly by several orders of magnitude.
By the time you get the model to converge, it will require many more iterations
to do so.