The logistic or sigmoid function maps any input to a value between 0 and 1. This property makes it a suitable activation function for the output layer of a deep neural network that is performing a classification task. Its value can be interpreted as the probability of the input being in a class.

One problem with the logistic function is that the slope of its tangent approaches zero for large values, either positive or negative. If, during initialization or training, a neuron’s parameters $\vec{\theta}$ happen to generate large values as input to its logistic activation function, then the partial derivatives of the composition function will be near zero. During stochastic gradient descent, the changes to the parameters are directly proportional to the partial derivatives. If the partial derivatives are near zero, then the parameters will get “frozen” and the neuron will stop learning. This is known as the vanishing gradient problem.

The rectified linear function is an alternative activation function that is suitable for the hidden layers of a deep neural network. Its equation is

It should be clear that the output of this function is exactly

• the same as its input when the input is greater than zero, and
• 0 otherwise.

The derivative of this function (the slope of its tangent) is exactly

• 1 when its input is greater than zero and
• 0 otherwise.

A hidden layer neuron having this activation function avoids the vanishing gradient problem as long as its parameters generate positive values. As soon as the parameters generate negative values, however, they stop changing, and the neuron outputs 0, effectively ignoring its inputs. The neuron then contributes to sparseness of the network. It simplifies the model by eliminating the contribution of certain features, and thereby avoids the problem of overfitting.

A neuron having this activation function is commonly called a rectified linear unit or “relu.”