In the second operational mode the network is presented some input along with a desired output – these are called

**target**or

**training**values – and the goal is to change the parameters of the network to bring the computed output values as close as possible to the target values. In this sense changing the parameters means

**learning**the weights (and bias values) of the network.

Maybe the most fundamental learning algorithm for feed-forward NNs is the so called

**Backpropagation technique**, and I will use that technique here to solve a simple nonlinear regression learning problem.

### How does Backpropagation work?

The general idea behind Backpropagation is pretty intuitive:- The network is presented a test input vector (with known output).
- The input is propagated through the network.
- The actual output is compared to the desired output (target value). The difference between them is the error.
- While propagating the error the whole way back to the input layer the weights are updated according to their influence on the error.

The weight update rule then looks as follows: $$w_{ji}^{(t)} = w_{ji}^{(t-1)} + \Delta w_{ji}$$

### Improved convergence properties

However, in the code below I will use a slightly different delta weight rule that adds a so called**momentum**term. The idea behind this is to flatten out possible oscillations for successive values of $\Delta w_{ji}$ that occur particularly during

**stochastic**or

**on-line**learning (as opposed to batch learning), which is what I will do here. It's therefore supposed to improve overall convergence properties of the method. The adapted delta weight then looks as follows: $$\Delta w_{ji}^{(t)} = \eta \delta_j z_i + \alpha \Delta w_{ji}^{(t-1)}$$ The negative side of this is a slight increase of complexity since for instance we have to store the old $\Delta w_{ji}$ values and we have to provide a value not only for $\eta$ but also for $\alpha$.

### A BackProp Implementation in F#

This section shows a possible implementation of the backpropagation technique with the momentum term improvement. I will first discuss the forward propagation, then the backward propagation in order to compute the gradients and finally the stochastic learning protocol to actually train an arbitrary network.### The way forward

As stated above the first thing we have to do is to present the network an input pattern and forward propagate it through the network to obtain the output. Following code allows me to do that and get the output along with all the necessary intermediate results.For convenience the/// combine weights and activation functions in a typetype NnetProperties = { Weights : Matrix<float> list Activations : (float -> float) list }/// compute the derivative of a function, midpoint rulelet derivative eps f = fun x -> ((f (x + eps/2.0) - f (x - eps/2.0)) / eps)/// precision for calculating the derivativeslet prc = 1e-6/// returns list of (out, out') vectors per layerlet feedforward (netProps : NnetProperties) input = List.fold (fun (os : (Vector<_> * Vector<_>) list) (W, f) -> let prevLayerOutput = match os.IsEmpty with | true -> input | _ -> fst (os.Head) let prevOut = prependForBias prevLayerOutput let layerInput = W * prevOut (layerInput |> Vector.map f, layerInput |> Vector.map (derivative prc f)) :: os) [] (List.zip netProps.Weights netProps.Activations)

`NnetProperties`

type combines the network properties into one place. Since later on we will need not only the network output but all outputs of all the units including the values of their derivatives, the function `feedforward`

constructs a list of tuples of two vectors containing these values. It's almost the same as the `layer`

function from the previous post, but with the difference that we store all the intermediate results results from all the hidden and output layers.### And then the way back

In order to compute the delta weight matrices I first have to compute the error signals $\delta_j$:The/// matlab like pointwise multiplylet (.*) (a : Vector<float>) (b : Vector<float>) = a.PointwiseMultiply(b)/// computes the error signals per layer /// starting at output layer towards first hidden layerlet errorSignals (Ws : Matrix<_> list) layeroutputs target = let trp = fun (W : Matrix<_>) -> Some(W.Transpose())// need weights and layer outputs in reverse order, // e.g starting from output layerlet weightsAndOutputs = let transposed = Ws |> List.tail |> List.map trp |> List.rev List.zip (None :: transposed) layeroutputs List.fold (fun prevDs ((W : Matrix<_> option), (o, o')) -> match W with | None -> (o' .* (target - o)) :: prevDs | Some(W) -> let ds = prevDs.Head (o' .* ((W * ds)).[1..]) :: prevDs) [] weightsAndOutputs

`errorSignals`

function returns a list of vectors containing all the error signals for every neuron, sorted by layer. Because the order of the computation is from output towards input layer and the structure is a list, the resulting $\delta_j$'s are ordered the other way around, i.e. from first hidden layer towards output layer. The next thing we want to compute is all the gradients, i.e. $\partial E_n / \partial w_{ji}$. As we have seen from the formulas above this is simply the outer product of the error signals ($\delta_j$) and the net outputs ($z_i$):

Quite nice is the fact that the/// computes a list of gradients matriceslet gradients (Ws : Matrix<_> list) layeroutputs input target = let actualOuts = layeroutputs |> List.unzip |> fst |> List.tail |> List.rev let signals = errorSignals Ws layeroutputs target (input :: actualOuts, signals) ||> List.zip |> List.map (fun (zs, ds) -> ds.OuterProduct(prependForBias zs))

`gradients`

function returns a list of matrices with the same dimensions as the weights matrices. ### Updating the weights

After computing the gradients we have everything in place so that we can now calculate the $\Delta w_{ji}$ and update the weights $w_{ji}$. As you can see in the following snippet, the`updateWeights`

function takes three lists as input: The weights, the gradients and the delta weights from the previous time step. All it does is a `List.map`

over the zipped lists calculating the delta weights and the new weights according to the update rule above.
let eta = 0.8 let alpha = 0.25In addition to the new weights the function/// updates the weights matrices with the given deltas /// of timesteps (t) and (t-1) /// returns the new weights matriceslet updateWeights Ws (Gs : Matrix<_> list) (prevDs : Matrix<_> list) = (List.zip3 Ws Gs prevDs) |> List.map (fun (W, G, prevD) -> let dW = eta * G + (alpha * prevD) W + dW, dW)

`updateWeights`

returns the delta weights $\Delta w_{ji}^{(t)}$ because we need them in the next calculation (momentum, see above). That's pretty much all there is and I can now go ahead and implement the training protocol.
### Training in epoches

As mentioned above I will use the so called**stochastic learning**protocol. According to this protocol the input patterns from the training set are fed into the network in random order. Usually this is done by choosing from the samples with replacement. The weights are updated in each step, which means about the following:

- Randomly choose an input pattern from the training set
- Forward propagate the pattern through the network
- Compute the delta weights
- Update the weights

The/// for each weight matrix builds another matrix with same dimension /// initialized with 0.0let initDeltaWeights (Ws : Matrix<_> list) = Ws |> List.map (fun W -> DenseMatrix(W.RowCount, W.ColumnCount, 0.0) :> Matrix<float>) let step netProps prevDs input target = let layeroutputs = feedforward netProps input let Gs = gradients netProps.Weights layeroutputs input target (updateWeights netProps.Weights Gs prevDs) let train netProps samples epoches = let count = samples |> Array.length let rnd = Random(seed) let Ws, fs = netProps.Weights, netProps.Activations let rec loop Ws Ds i = match i < (epoches * count) with | true -> let inp, trg = samples.[rnd.Next(count)] let netProps = { Weights = Ws; Activations = fs } let ws, ds = List.unzip (step netProps Ds inp trg) loop ws ds (i + 1) | _ -> Ws let Ws' = loop Ws (initDeltaWeights Ws) 0 { netProps with Weights = Ws' }

`initDeltaWeights`

function is just a helper function to initialize the delta weights matrices with all zeros at the beginning of the training. `train`

takes three arguments: a `NnetProperties`

record with all the $w_{ji}$ initialized to random floats, a training set `samples`

with input patterns and the corresponding target values and a number of epoches as an indication for how long we are willing to train the network.The actual work is done in the inner recursive loop function. An arbitrary input pattern and corresponding target value is picked from the sample, a

`NnetProperties`

record is constructed with the given weights and activation functions and then `step`

is called to update and return the new weights and delta weights. The function returns a new `NnetProperties`

record with the learned weights. To keep the following test runs reproducible I instantiate the `Random`

with a seed.### A Simple Regression Example

To show that everything is working properly I will demonstrate it using a simple regression problem. Let's take the nonlinear function $sin(6x)$ in the range from -0.5 to 0.5 as the underlying process. From that I'll produce a random set containing 25 training samples with some noise added. Then I will construct a neural network with one hidden layer that contains 15 neurons (1-15-1 configuration). This network will be trained a number of times, each time with a different number of epoches.The code that does this:

Let's briefly go through the code:/// returns the output vector from a given list of layer outputslet netoutput (layeroutputs : ('a * 'a) list) = fst (layeroutputs.Head)/// computes the output error from a /// given target and an actual output vectorlet error (target : Vector<_>) (output : Vector<_>) = ((target - output) |> Vector.map (fun x -> x * x) |> Vector.toArray |> Array.sum) / 2.0 let initWeights rows cols f = DenseMatrix(Array2D.init rows cols (fun _ _ -> f())) :> Matrix<float> let targetFun = fun x -> sin (6.0 * x) let computeResults netProps trainingset epoches = let netProps' = train netProps trainingset epoches let setSize = trainingset.Length let error = trainingset |> Array.fold (fun E (x, t) -> let outs = feedforward netProps' x let En = error t (netoutput outs) E + En) 0.0 let outputs = [-0.5 .. 0.01 .. 0.5] |> List.fold (fun outs x -> let layeroutputs = feedforward netProps' (vector [x]) let o = (netoutput layeroutputs).At 0 (x,o) :: outs) [] |> List.rev (error / (float setSize), outputs) let experimentSetting() = let rnd = Random(seed) let randZ() = rnd.NextDouble() - 0.5 let samples = [| for i in 1 .. 25 -> randZ() |] |> Array.map(fun x -> x, targetFun(x) + 0.15 * randZ()) let trainingSet = samples |> Array.map (fun (x,y) -> vector [x], vector [y]) let Wih = initWeights 15 2 randZ let Who = initWeights 1 16 randZ let netProps = { Weights = [Wih; Who]; Activations = [sigmoid; tanh]} (samples, trainingSet, netProps) let testRun experiment listOfEpoches = let samples, ts, netProps = experiment() let data = listOfEpoches |> List.fold (fun acc N -> let error, outs = computeResults netProps ts N printfn "mean error after %d epoches: %f" N error outs :: acc) [] |> List.rev (samples, data)

`netoutput`

returns the actual output vector when given the layer outputs from the `feedforward`

function (which happens to be the first part of the first tuple in the sequence). `error`

computes an error given an output and a target vector. Usually training starts with an untrained network, e.g. the actual weights and biases are initialized to random values. That's what

`initWeights`

and `randZ`

do. `targetFun`

simply is our target function, in our case $sin(6x)$.`computeResults`

trains a given network with a given training set a given number of epoches and computes a mean error and produces the outputs from the network in the same range as our target function, namely in -0.5 to 0.5. This data is used for the plot below. `experimentSetting`

defines an experiment – it builds up a training set as mentioned above and constructs a 1-15-1 neural network with randomly initialized weights. Finally the `testRun`

function takes an experiment setting and a list of epoches to train the network and collects the output data.In F# Interactive the test run can now be started.

> let samples, data = testRun experimentSetting [75; 500; 2000];; mean error after 75 epoches: 0.043278 mean error after 500 epoches: 0.017267 mean error after 2000 epoches: 0.001506 val samples : (float * float) [] = [|(-0.02707827395, -0.1465612293); (-0.4550698469, -0.4159703715); (-0.04310941535, -0.2552857918); (0.4060334279, 0.6611061557); ... val data : (float * float) list list = [[(-0.5, -0.9956174648); (-0.49, -0.9952092317); (-0.48, -0.9947566762); (-0.47, -0.9942545452); (-0.46, -0.993696933); (-0.45, -0.9930771989); ...The following plot shows the result (click to enlarge) It can be seen that the network is learning the "nature" of the unterlying hidden data generating process. And that's what we wanted to achieve.

### Conclusion

Cool. It kind of works. In this post I presented a working implementation of an improved Backpropagation technique, interpreted in a functional way.The Standard Backpropagation technique as the most fundamental learning algorithm for artificial neural networks sort of stands at the beginning of the success story of ANNs. I think it was also the first technique that was able to train multilayer neural networks at all. However, Backpropagation has also some severe drawbacks, the first being the fact that one has to define the learning rate $\eta$ (and in case of the improved version with the additional momentum term used here $\alpha$ as well!) – it turned out that this is never an easy task. If at all possible.

### What's next?

Today there are many other learning techniques around. One of them is called**Resilient Backpropagation**and is known as one of the best performing first-order learning algorithms in terms of learning speed and generalization properties. And it even relieves the user from deciding on a learning rate. That sounds promising, doesn't it? I think I will take a look at

**iRprop+**next time.

Great post. I look forward to your next post on iRprop+

ReplyDelete