In the previous posts, we used a simple linear logistic model to perform a computer vision task with 91% accuracy or better. It was linear because it used the raw inputs, not their products or powers. It was logistic, because it composed a linear model with the logistic function to classify the input vector.

One way to improve the accuracy of the program is to assume a more complex model, for example a polynomial logistic model. However, in complex problems like computer vision or speech recognition, there may be hundreds or even millions of input variables. (For example, smartphone cameras routinely capture 2.8 million pixels, each of which is a combination of red, green, and blue values.) For such problems, generating a complete polynomial feature vector (including all possible products of input variables and their powers) is intractable. And this is before even starting to calculate coefficients $$\vec{\theta}$$ for each input and feature.

Multi-layer perceptrons approach the multivariate logistic regression problem by learning the appropriate features at the same time as their coefficients $$\vec{\theta}$$. By “appropriate features” we mean functions of the raw inputs—whether their powers, their products, or for that matter, anything else.

The logistic regression model function used a vector $$\vec{\theta}$$ which contained the coefficients of the input variables. We can illustrate it as follows:

Here is how the multilayer perceptron works: the raw input is “classified” by numerous logistic regression models. Each determines whether the input exhibits a particular feature or not. (Because we expect the input to exhibit multiple features, we use the logistic function, not softmax, for each model.) The parameters of each model determine which feature that model detects.

Those features are then classified by a multiclass logistic regression model. This model determines which distinct class the combination of features is in. (Because we expect the input to be in a single class, we use the softmax function for this classifier. This gives us a probability distribution across classes.) This is illustrated as follows:

Note that the multilayer perceptron is composed of logistic and multiclass logistic models. Therefore the cost function remains differentiable by all parameters. This allows us to use gradient descent to find the parameters that minimize it.