Many useful problems involve classifying an input in one of multiple classes, for example,

  • Is an e-mail spam, work-related, or personal?

  • Which numeral between 0 and 9 is shown in an image?

  • Which of several thousand words was uttered in a clip of audio?

You might be tempted to handle such problems using a model whose output is an integer identifying the class. However, a more straightforward approach is to have a logistic model for each class whose output indicates the probability that the input is in that class. For a given input, the model that produces the highest probability corresponds to the predicted class.

It is a simple matter to divide each output by the sum of the outputs. This ensures that the sum of the probabilities is 1.0. This combination of outputs is a predicted probability distribution across classes.

In this case, the training data consists of examples where each is the target probability distribution—a vector with a 1 corresponding to the expected class and zeroes elsewhere (conventionally called a one-hot vector).

We can use matrices to keep track of the multiple models. Let \( \vec{\theta} = \) ` [[\theta_0^{(1)}, \theta_0^{(2)}, …], [\theta_1^{(1)}, \theta_1^{(2)}, …]]` with a column for each class. Then the model equation can be compactly expressed as before:

The matrices and operations will become clearer in the computer code, but

  • is a matrix with a row for each class and a column for each input or feature.

  • is a matrix multiplication operation resulting in a row for each class.

  • is an elementwise operation, resulting in a row for each class.

  • is a column vector with a 1 for each class, and is a matrix addition operation, resulting in a row for each class.

  • is also an elementwise operation, resulting in a row for each class.