## Thursday, January 24, 2013

### Functional Feed-forward Neural Networks, Part 2: Backpropagation Learning

This time I will deal with the learning problem. Feed-forward neural networks operate in two distinguishable ways, the first being the feed-forward computation, in which the network is presented some input data in the form of a vector and this input is passed through the network to yield an output. This is what I talked about in the first part of this blog series and can be found here: Functional Feed-forward Neural Networks, Part 1: Setting it up.
In the second operational mode the network is presented some input along with a desired output – these are called target or training values – and the goal is to change the parameters of the network to bring the computed output values as close as possible to the target values. In this sense changing the parameters means learning the weights (and bias values) of the network.
Maybe the most fundamental learning algorithm for feed-forward NNs is the so called Backpropagation technique, and I will use that technique here to solve a simple nonlinear regression learning problem.

### How does Backpropagation work?

The general idea behind Backpropagation is pretty intuitive:
1. The network is presented a test input vector (with known output).
2. The input is propagated through the network.
3. The actual output is compared to the desired output (target value). The difference between them is the error.
4. While propagating the error the whole way back to the input layer the weights are updated according to their influence on the error.
The error for a particular input vector $n$ is: $$E_n = \frac{1}{2}\sum_k(y_k(\mathbf{x}_n, \mathbf{w}) - t_{nk})^2$$ The overall error (over all input patterns): $$E(\mathbf{w}) = \sum_{n=1}^N E_n(\mathbf{w})$$ We are now interested in the partial derivatives of this error function $E_n$ with respect to weight $w_{ji}$. The second part of the following equation shows the chain rule for partial derivatives, $z_i$ denoting the output or activation of unit $i$ and $\delta_j$ the error signal of unit $j$. $$\frac{\partial E_n}{\partial w_{ji}} = \frac{\partial E_n}{\partial a_j} \frac{\partial a_j}{\partial w_{ji}} = -\delta_j z_i$$ That leads to the following delta weight rule, $\eta$ denoting the learning rate: $$\Delta w_{ji} = -\eta \frac{\partial E_n}{\partial w_{ji}} = \eta \delta_j z_i$$ with $$\delta_j = \begin{cases} h'(a_j)(y_j-t_j) & \text{ if } j \text{ is output neuron} \\ h'(a_j)\sum_{k}w_{kj}\delta_k & \text{ if } j \text{ is hidden neuron} \end{cases}$$ Following figure (again taken from Bishop's PRML book) illustrates the calculation of the error signal $\delta_j$ for the hidden unit $j$: The $\delta$'s from units $k$ are propagated back according to the weights $w_{kj}$.

The weight update rule then looks as follows: $$w_{ji}^{(t)} = w_{ji}^{(t-1)} + \Delta w_{ji}$$

### Improved convergence properties

However, in the code below I will use a slightly different delta weight rule that adds a so called momentum term. The idea behind this is to flatten out possible oscillations for successive values of $\Delta w_{ji}$ that occur particularly during stochastic or on-line learning (as opposed to batch learning), which is what I will do here. It's therefore supposed to improve overall convergence properties of the method. The adapted delta weight then looks as follows: $$\Delta w_{ji}^{(t)} = \eta \delta_j z_i + \alpha \Delta w_{ji}^{(t-1)}$$ The negative side of this is a slight increase of complexity since for instance we have to store the old $\Delta w_{ji}$ values and we have to provide a value not only for $\eta$ but also for $\alpha$.

### A BackProp Implementation in F#

This section shows a possible implementation of the backpropagation technique with the momentum term improvement. I will first discuss the forward propagation, then the backward propagation in order to compute the gradients and finally the stochastic learning protocol to actually train an arbitrary network.

### The way forward

As stated above the first thing we have to do is to present the network an input pattern and forward propagate it through the network to obtain the output. Following code allows me to do that and get the output along with all the necessary intermediate results.
/// combine weights and activation functions in a type
type NnetProperties = {
Weights : Matrix<float> list
Activations : (float -> float) list
}

/// compute the derivative of a function, midpoint rule
let derivative eps f =
fun x -> ((f (x + eps/2.0) - f (x - eps/2.0)) / eps)

/// precision for calculating the derivatives
let prc = 1e-6

/// returns list of (out, out') vectors per layer
let feedforward (netProps : NnetProperties) input =
List.fold
(fun (os : (Vector<_> * Vector<_>) list) (W, f) ->
let prevLayerOutput =
match os.IsEmpty with
| true -> input
let prevOut = prependForBias prevLayerOutput
let layerInput = W * prevOut
(layerInput |> Vector.map f,
layerInput |> Vector.map (derivative prc f)) :: os)
[] (List.zip netProps.Weights netProps.Activations)

For convenience the NnetProperties type combines the network properties into one place. Since later on we will need not only the network output but all outputs of all the units including the values of their derivatives, the function feedforward constructs a list of tuples of two vectors containing these values. It's almost the same as the layer function from the previous post, but with the difference that we store all the intermediate results results from all the hidden and output layers.

### And then the way back

In order to compute the delta weight matrices I first have to compute the error signals $\delta_j$:
/// matlab like pointwise multiply
let (.*) (a : Vector<float>) (b : Vector<float>) =
a.PointwiseMultiply(b)

/// computes the error signals per layer
/// starting at output layer towards first hidden layer
let errorSignals (Ws : Matrix<_> list) layeroutputs target =
let trp = fun (W : Matrix<_>) -> Some(W.Transpose())

// need weights and layer outputs in reverse order,
// e.g starting from output layer
let weightsAndOutputs =
let transposed = Ws |> List.tail |> List.map trp |> List.rev
List.zip (None :: transposed) layeroutputs

List.fold (fun prevDs ((W : Matrix<_> option), (o, o')) ->
match W with
| None    -> (o' .* (target - o)) :: prevDs
| Some(W) -> let ds = prevDs.Head
(o' .* ((W * ds)).[1..]) :: prevDs)
[] weightsAndOutputs


The errorSignals function returns a list of vectors containing all the error signals for every neuron, sorted by layer. Because the order of the computation is from output towards input layer and the structure is a list, the resulting $\delta_j$'s are ordered the other way around, i.e. from first hidden layer towards output layer.
The next thing we want to compute is all the gradients, i.e. $\partial E_n / \partial w_{ji}$. As we have seen from the formulas above this is simply the outer product of the error signals ($\delta_j$) and the net outputs ($z_i$):
/// computes a list of gradients matrices
let gradients (Ws : Matrix<_> list) layeroutputs input target =
let actualOuts =
layeroutputs |> List.unzip |> fst |> List.tail |> List.rev
let signals = errorSignals Ws layeroutputs target
(input :: actualOuts, signals)
||> List.zip
|> List.map (fun (zs, ds) ->
ds.OuterProduct(prependForBias zs))

Quite nice is the fact that the gradients function returns a list of matrices with the same dimensions as the weights matrices.

### Updating the weights

After computing the gradients we have everything in place so that we can now calculate the $\Delta w_{ji}$ and update the weights $w_{ji}$. As you can see in the following snippet, the updateWeights function takes three lists as input: The weights, the gradients and the delta weights from the previous time step. All it does is a List.map over the zipped lists calculating the delta weights and the new weights according to the update rule above.
let eta = 0.8
let alpha = 0.25

/// updates the weights matrices with the given deltas
/// of timesteps (t) and (t-1)
/// returns the new weights matrices
let updateWeights Ws (Gs : Matrix<_> list) (prevDs : Matrix<_> list) =
(List.zip3 Ws Gs prevDs)
|> List.map (fun (W, G, prevD) ->
let dW = eta * G + (alpha * prevD)
W + dW, dW)

In addition to the new weights the function updateWeights returns the delta weights $\Delta w_{ji}^{(t)}$ because we need them in the next calculation (momentum, see above). That's pretty much all there is and I can now go ahead and implement the training protocol.

### Training in epoches

As mentioned above I will use the so called stochastic learning protocol. According to this protocol the input patterns from the training set are fed into the network in random order. Usually this is done by choosing from the samples with replacement. The weights are updated in each step, which means about the following:
1. Randomly choose an input pattern from the training set
2. Forward propagate the pattern through the network
3. Compute the delta weights
4. Update the weights
Translated into code that could look as follows:
/// for each weight matrix builds another matrix with same dimension
/// initialized with 0.0
let initDeltaWeights (Ws : Matrix<_> list) =
Ws |> List.map (fun W ->
DenseMatrix(W.RowCount, W.ColumnCount, 0.0) :> Matrix<float>)

let step netProps prevDs input target =
let layeroutputs = feedforward netProps input
let Gs = gradients netProps.Weights layeroutputs input target
(updateWeights netProps.Weights Gs prevDs)

let train netProps samples epoches =
let count = samples |> Array.length
let rnd = Random(seed)
let Ws, fs = netProps.Weights, netProps.Activations
let rec loop Ws Ds i =
match i < (epoches * count) with
| true ->
let inp, trg = samples.[rnd.Next(count)]
let netProps = { Weights = Ws; Activations = fs }
let ws, ds = List.unzip (step netProps Ds inp trg)
loop ws ds (i + 1)
| _    -> Ws
let Ws' = loop Ws (initDeltaWeights Ws) 0
{ netProps with Weights = Ws' }

The initDeltaWeights function is just a helper function to initialize the delta weights matrices with all zeros at the beginning of the training. train takes three arguments: a NnetProperties record with all the $w_{ji}$ initialized to random floats, a training set samples with input patterns and the corresponding target values and a number of epoches as an indication for how long we are willing to train the network.
The actual work is done in the inner recursive loop function. An arbitrary input pattern and corresponding target value is picked from the sample, a NnetProperties record is constructed with the given weights and activation functions and then step is called to update and return the new weights and delta weights. The function returns a new NnetProperties record with the learned weights. To keep the following test runs reproducible I instantiate the Random with a seed.

### A Simple Regression Example

To show that everything is working properly I will demonstrate it using a simple regression problem. Let's take the nonlinear function $sin(6x)$ in the range from -0.5 to 0.5 as the underlying process. From that I'll produce a random set containing 25 training samples with some noise added. Then I will construct a neural network with one hidden layer that contains 15 neurons (1-15-1 configuration). This network will be trained a number of times, each time with a different number of epoches.
The code that does this:
/// returns the output vector from a given list of layer outputs
let netoutput (layeroutputs : ('a * 'a) list) =

/// computes the output error from a
/// given target and an actual output vector
let error (target : Vector<_>) (output : Vector<_>) =
((target - output)
|> Vector.map (fun x -> x * x)
|> Vector.toArray
|> Array.sum) / 2.0

let initWeights rows cols f =
DenseMatrix(Array2D.init rows cols (fun _ _ -> f()))
:> Matrix<float>

let targetFun = fun x -> sin (6.0 * x)

let computeResults netProps trainingset epoches =
let netProps' = train netProps trainingset epoches
let setSize = trainingset.Length

let error =
trainingset
|> Array.fold (fun E (x, t) ->
let outs = feedforward netProps' x
let En = error t (netoutput outs)
E + En) 0.0

let outputs =
[-0.5 .. 0.01 .. 0.5]
|> List.fold (fun outs x ->
let layeroutputs =
feedforward netProps' (vector [x])
let o = (netoutput layeroutputs).At 0
(x,o) :: outs) []
|> List.rev

(error / (float setSize), outputs)

let experimentSetting() =
let rnd = Random(seed)
let randZ() = rnd.NextDouble() - 0.5

let samples =
[| for i in 1 .. 25 -> randZ() |]
|> Array.map(fun x ->
x, targetFun(x) + 0.15 * randZ())

let trainingSet =
samples
|> Array.map (fun (x,y) -> vector [x], vector [y])

let Wih = initWeights 15 2 randZ
let Who = initWeights 1 16 randZ
let netProps = { Weights = [Wih; Who];
Activations = [sigmoid; tanh]}
(samples, trainingSet, netProps)

let testRun experiment listOfEpoches =
let samples, ts, netProps = experiment()
let data =
listOfEpoches
|> List.fold (fun acc N ->
let error, outs = computeResults netProps ts N
printfn "mean error after %d epoches: %f" N error
outs :: acc) []
|> List.rev

(samples, data)

Let's briefly go through the code: netoutput returns the actual output vector when given the layer outputs from the feedforward function (which happens to be the first part of the first tuple in the sequence). error computes an error given an output and a target vector.
Usually training starts with an untrained network, e.g. the actual weights and biases are initialized to random values. That's what initWeights and randZ do. targetFun simply is our target function, in our case $sin(6x)$.
computeResults trains a given network with a given training set a given number of epoches and computes a mean error and produces the outputs from the network in the same range as our target function, namely in -0.5 to 0.5. This data is used for the plot below.
experimentSetting defines an experiment – it builds up a training set as mentioned above and constructs a 1-15-1 neural network with randomly initialized weights. Finally the testRun function takes an experiment setting and a list of epoches to train the network and collects the output data.
In F# Interactive the test run can now be started.
> let samples, data = testRun experimentSetting [75; 500; 2000];;

mean error after 75 epoches: 0.043278
mean error after 500 epoches: 0.017267
mean error after 2000 epoches: 0.001506

val samples : (float * float) [] =
[|(-0.02707827395, -0.1465612293); (-0.4550698469, -0.4159703715);
(-0.04310941535, -0.2552857918); (0.4060334279, 0.6611061557);
...
val data : (float * float) list list =
[[(-0.5, -0.9956174648); (-0.49, -0.9952092317); (-0.48, -0.9947566762);
(-0.47, -0.9942545452); (-0.46, -0.993696933); (-0.45, -0.9930771989);
...


The following plot shows the result (click to enlarge) It can be seen that the network is learning the "nature" of the unterlying hidden data generating process. And that's what we wanted to achieve.

### Conclusion

Cool. It kind of works. In this post I presented a working implementation of an improved Backpropagation technique, interpreted in a functional way.
The Standard Backpropagation technique as the most fundamental learning algorithm for artificial neural networks sort of stands at the beginning of the success story of ANNs. I think it was also the first technique that was able to train multilayer neural networks at all. However, Backpropagation has also some severe drawbacks, the first being the fact that one has to define the learning rate $\eta$ (and in case of the improved version with the additional momentum term used here $\alpha$ as well!) – it turned out that this is never an easy task. If at all possible.

### What's next?

Today there are many other learning techniques around. One of them is called Resilient Backpropagation and is known as one of the best performing first-order learning algorithms in terms of learning speed and generalization properties. And it even relieves the user from deciding on a learning rate. That sounds promising, doesn't it? I think I will take a look at iRprop+ next time.

#### 1 comment:

1. Great post. I look forward to your next post on iRprop+