A Multilayer Perceptron in Julia
The Julia program below uses a multi-layer perceptron to classify images from the MNIST dataset. The program should run in several minutes and achieve an accuracy just under 97%. It has a few notable differences from the previous program that only used logistic regression.
This program uses the FluxML library. Run
Pkg.add("Flux")
before running it. The FluxML API is easier to use than
TensorFlow because it does not separate graph construction and execution.
This program has tracked parameters featureThetas
and featureTheta0
for
the feature detection models as well as outputThetas
and outputTheta0
for
the classification model.
featureThetas = param([scale(r, 784, 500) for r in rand((784, 500))]) # param(zeros((784, 500))) featureTheta0 = param([scale(r, 1, 500) for r in rand((1, 500))]) # param(zeros((1, 500)))
outputThetas = param([scale(r, 500, 10) for r in rand((500, 10))]) # param(zeros((500, 10))) outputTheta0 = param([scale(r, 1, 10) for r in rand((1, 10))]) # param(zeros((1, 10)))
In FluxML, tracked parameters are like any other numerical or array type, with a notable exception: When you perform mathematical operations on a tracked parameter, it is able to record the gradients of the operations with respect to itself.
Feature detection happens with logistic regression models operating on raw inputs:
features(example) = 1.0 ./ (1.0 .+ exp.(example * featureThetas .+ featureTheta0))
Classification happens with a softmax model operating on features:
model(example) = NNlib.softmax(features(example) * outputThetas .+ outputTheta0)
During training, the back!()
function calculates the cost with the side
effect of accumulating gradients. The gradients are retrieved with calls to
grad()
.
back!(cross_entropy) update!(featureThetas, -learningRate .* grad(featureThetas)) update!(featureTheta0, -learningRate .* grad(featureTheta0)) update!(outputThetas, -learningRate .* grad(outputThetas)) update!(outputTheta0, -learningRate .* grad(outputTheta0))
Number of Features
The model implemented by this program detects 500 features of the input. The number 500 is called a hyperparameter of the model (in contrast to the parameters ). The choice of hyperparameter values is a matter of experience and experimentation. Generally, if the number of features is smaller than the number of inputs, then the features are a “compressed” representation of the input.
Initialization of
The previous program initialized thetas
and theta0
values to zero. For
multi-layer perceptrons, this is not a good idea. When their parameter values
are exactly the same, the feature detection models are all detecting the same
feature! Their gradients will be the same and they will be updated by the same
amounts, rendering them largely redundant. Researchers have searched for good
initial values of , and this program uses random
values in a range recommended by Yoshua Bengio and Xavier Glorot in their
AISTATS 2010 paper, Understanding the difficulty of training deep feedforward
neural networks. To
convince yourself of the value of careful initialization, replace the
random initializations with the calls to zeros
in the comments.
You will notice a decline in accuracy.