This time I will deal with the learning problem. Feed-forward neural networks operate in two distinguishable ways, the first being the feed-forward computation, in which the network is presented some input data in the form of a vector and this input is passed through the network to yield an output. This is what I talked about in the first part of this blog series and can be found here:

Functional Feed-forward Neural Networks, Part 1: Setting it up.

In the second operational mode the network is presented some input along with a desired output – these are called

**target** or

**training** values – and the goal is to change the parameters of the network to bring the computed output values as close as possible to the target values. In this sense changing the parameters means

**learning** the weights (and bias values) of the network.

Maybe the most fundamental learning algorithm for feed-forward NNs is the so called

**Backpropagation technique**, and I will use that technique here to solve a simple nonlinear regression learning problem.

###
How does Backpropagation work?

The general idea behind Backpropagation is pretty intuitive:

- The network is presented a test input vector (with known output).
- The input is propagated through the network.
- The actual output is compared to the desired output (target value). The difference between them is the error.
- While propagating the error the whole way back to the input layer the weights are updated according to their influence on the error.

The error for a particular input vector $n$ is:
$$E_n = \frac{1}{2}\sum_k(y_k(\mathbf{x}_n, \mathbf{w}) - t_{nk})^2$$
The overall error (over all input patterns):
$$E(\mathbf{w}) = \sum_{n=1}^N E_n(\mathbf{w})$$
We are now interested in the partial derivatives of this error function $E_n$ with respect to weight $w_{ji}$. The second part of the following equation shows the chain rule for partial derivatives, $z_i$ denoting the output or activation of unit $i$ and $\delta_j$ the error signal of unit $j$.
$$\frac{\partial E_n}{\partial w_{ji}} = \frac{\partial E_n}{\partial a_j} \frac{\partial a_j}{\partial w_{ji}} = -\delta_j z_i$$
That leads to the following delta weight rule, $\eta$ denoting the learning rate:
$$\Delta w_{ji} = -\eta \frac{\partial E_n}{\partial w_{ji}} = \eta \delta_j z_i$$
with
$$\delta_j = \begin{cases}
h'(a_j)(y_j-t_j) & \text{ if } j \text{ is output neuron} \\
h'(a_j)\sum_{k}w_{kj}\delta_k & \text{ if } j \text{ is hidden neuron}
\end{cases}$$
Following figure (again taken from Bishop's PRML book) illustrates the calculation of the error signal $\delta_j$ for the hidden unit $j$: The $\delta$'s from units $k$ are propagated back according to the weights $w_{kj}$.

The weight update rule then looks as follows:
$$w_{ji}^{(t)} = w_{ji}^{(t-1)} + \Delta w_{ji}$$