2.6. Back Propagation
Contents
2.6. Back Propagation#
The backpropagation is the second step of the learning, which consists of injecting the error committed in the forward propagation phase (error while making predictions because the parameters are not completely trained yet) into the network in the reverse direction (from output layer to input layer) and update its parameters to perform better on the next iteration.
Today, the backpropagation algorithm is the workhorse of learning in neural networks. At the heart of backpropagation is an expression for the partial derivative of the cost function \(J(W, b)\) with respect to weight \(W\) (or bias \(b\)) in the network. The expression tells us how quickly the cost changes when we change the weights and biases.
Hence, the optimization of the cost function \(J(W, b)\) is needed and it is usually performed through a descent method.
Derivation using Chain rule#
Note: \(\odot\) means element wise multiplication (also called Hadamard product)
Since we need to calculate the partial derivative of the cost function \(J(W, b)\) (link to previous chapter) with respect to weight \(W\) (or bias \(b\)), we can do it using chain rule.
Backpropagation can be summarized using four different equations (vectorized form) considering the notations presented in terminologies part-1 (link to previous chapter).
Note: For any layer \(l\) (\(l=1,2,...,L\)), we call the
backpropagation error
(\(\delta_l\)) in that layer as the partial derivative of the cost function \(J(W, b)\) with respect to weighted sum of that layer \(z_l\). That is:
Output layer Backpropagation Error \((\delta_L)\)
For the output layer (\(l=L\)), we know from forward propagation (link to previous chapter) that
and the cost function \(J(W,b)\) is written using \(a_L\) and \(y\). So, using the chain rule,
Hidden layer Backpropagation Error \((\delta_l)\)
Now, for the hidden layers (\(l=L-1, L-2,...,1\)), we know from forward propagation (link to previous chapter) that
and
Therefore,
We have
So,
Rate of change of the cost with respect to weights \(W_l\)
Now, the partial derivative of the cost function \(J(W, b)\) with respect to weight of that layer \(W_l\) will be:
Rate of change of the cost with respect to bias \(b_l\)
Finally, the partial derivative of the cost function \(J(W, b)\) with respect to bias of that layer \(b_l\) will be:
Equations summary#
Output layer Backpropagation error (\(\delta_L\))
Hidden layer Backpropagation error (\(\delta_l\))
Rate of change of the cost with respect to weights \(W_l\)
Rate of change of the cost with respect to bias \(b_l\)
where \(\sum \delta_l\) is summation over all the samples of \(\delta_l\). Size of \(\delta_l\) is \((m, h_l)\). So, we sum along the column (i.e. sum of column-1 then sum of column-2 and so on) to get a vector of size \((h_l, 1)\) which is same as \(b_l\).
Also, \(Q^T\) denotes the transpose of any matrix \(Q\). That’s it. We now use these gradients to update the parameters weight \(W\) and bias \(b\).