In the second operational mode the network is presented some input along with a desired output – these are called target or training values – and the goal is to change the parameters of the network to bring the computed output values as close as possible to the target values. In this sense changing the parameters means learning the weights (and bias values) of the network.
Maybe the most fundamental learning algorithm for feed-forward NNs is the so called Backpropagation technique, and I will use that technique here to solve a simple nonlinear regression learning problem.
How does Backpropagation work?
The general idea behind Backpropagation is pretty intuitive:- The network is presented a test input vector (with known output).
- The input is propagated through the network.
- The actual output is compared to the desired output (target value). The difference between them is the error.
- While propagating the error the whole way back to the input layer the weights are updated according to their influence on the error.
The weight update rule then looks as follows: $$w_{ji}^{(t)} = w_{ji}^{(t-1)} + \Delta w_{ji}$$
Improved convergence properties
However, in the code below I will use a slightly different delta weight rule that adds a so called momentum term. The idea behind this is to flatten out possible oscillations for successive values of $\Delta w_{ji}$ that occur particularly during stochastic or on-line learning (as opposed to batch learning), which is what I will do here. It's therefore supposed to improve overall convergence properties of the method. The adapted delta weight then looks as follows: $$\Delta w_{ji}^{(t)} = \eta \delta_j z_i + \alpha \Delta w_{ji}^{(t-1)}$$ The negative side of this is a slight increase of complexity since for instance we have to store the old $\Delta w_{ji}$ values and we have to provide a value not only for $\eta$ but also for $\alpha$.A BackProp Implementation in F#
This section shows a possible implementation of the backpropagation technique with the momentum term improvement. I will first discuss the forward propagation, then the backward propagation in order to compute the gradients and finally the stochastic learning protocol to actually train an arbitrary network.The way forward
As stated above the first thing we have to do is to present the network an input pattern and forward propagate it through the network to obtain the output. Following code allows me to do that and get the output along with all the necessary intermediate results./// combine weights and activation functions in a type type NnetProperties = { Weights : Matrix<float> list Activations : (float -> float) list } /// compute the derivative of a function, midpoint rule let derivative eps f = fun x -> ((f (x + eps/2.0) - f (x - eps/2.0)) / eps) /// precision for calculating the derivatives let prc = 1e-6 /// returns list of (out, out') vectors per layer let feedforward (netProps : NnetProperties) input = List.fold (fun (os : (Vector<_> * Vector<_>) list) (W, f) -> let prevLayerOutput = match os.IsEmpty with | true -> input | _ -> fst (os.Head) let prevOut = prependForBias prevLayerOutput let layerInput = W * prevOut (layerInput |> Vector.map f, layerInput |> Vector.map (derivative prc f)) :: os) [] (List.zip netProps.Weights netProps.Activations)For convenience the
NnetProperties
type combines the network properties into one place. Since later on we will need not only the network output but all outputs of all the units including the values of their derivatives, the function feedforward
constructs a list of tuples of two vectors containing these values. It's almost the same as the layer
function from the previous post, but with the difference that we store all the intermediate results results from all the hidden and output layers.And then the way back
In order to compute the delta weight matrices I first have to compute the error signals $\delta_j$:/// matlab like pointwise multiply let (.*) (a : Vector<float>) (b : Vector<float>) = a.PointwiseMultiply(b) /// computes the error signals per layer /// starting at output layer towards first hidden layer let errorSignals (Ws : Matrix<_> list) layeroutputs target = let trp = fun (W : Matrix<_>) -> Some(W.Transpose()) // need weights and layer outputs in reverse order, // e.g starting from output layer let weightsAndOutputs = let transposed = Ws |> List.tail |> List.map trp |> List.rev List.zip (None :: transposed) layeroutputs List.fold (fun prevDs ((W : Matrix<_> option), (o, o')) -> match W with | None -> (o' .* (target - o)) :: prevDs | Some(W) -> let ds = prevDs.Head (o' .* ((W * ds)).[1..]) :: prevDs) [] weightsAndOutputsThe
errorSignals
function returns a list of vectors containing all the error signals for every neuron, sorted by layer. Because the order of the computation is from output towards input layer and the structure is a list, the resulting $\delta_j$'s are ordered the other way around, i.e. from first hidden layer towards output layer. The next thing we want to compute is all the gradients, i.e. $\partial E_n / \partial w_{ji}$. As we have seen from the formulas above this is simply the outer product of the error signals ($\delta_j$) and the net outputs ($z_i$):
/// computes a list of gradients matrices let gradients (Ws : Matrix<_> list) layeroutputs input target = let actualOuts = layeroutputs |> List.unzip |> fst |> List.tail |> List.rev let signals = errorSignals Ws layeroutputs target (input :: actualOuts, signals) ||> List.zip |> List.map (fun (zs, ds) -> ds.OuterProduct(prependForBias zs))Quite nice is the fact that the
gradients
function returns a list of matrices with the same dimensions as the weights matrices. Updating the weights
After computing the gradients we have everything in place so that we can now calculate the $\Delta w_{ji}$ and update the weights $w_{ji}$. As you can see in the following snippet, theupdateWeights
function takes three lists as input: The weights, the gradients and the delta weights from the previous time step. All it does is a List.map
over the zipped lists calculating the delta weights and the new weights according to the update rule above.
let eta = 0.8 let alpha = 0.25 /// updates the weights matrices with the given deltas /// of timesteps (t) and (t-1) /// returns the new weights matrices let updateWeights Ws (Gs : Matrix<_> list) (prevDs : Matrix<_> list) = (List.zip3 Ws Gs prevDs) |> List.map (fun (W, G, prevD) -> let dW = eta * G + (alpha * prevD) W + dW, dW)In addition to the new weights the function
updateWeights
returns the delta weights $\Delta w_{ji}^{(t)}$ because we need them in the next calculation (momentum, see above). That's pretty much all there is and I can now go ahead and implement the training protocol.
Training in epoches
As mentioned above I will use the so called stochastic learning protocol. According to this protocol the input patterns from the training set are fed into the network in random order. Usually this is done by choosing from the samples with replacement. The weights are updated in each step, which means about the following:- Randomly choose an input pattern from the training set
- Forward propagate the pattern through the network
- Compute the delta weights
- Update the weights
/// for each weight matrix builds another matrix with same dimension /// initialized with 0.0 let initDeltaWeights (Ws : Matrix<_> list) = Ws |> List.map (fun W -> DenseMatrix(W.RowCount, W.ColumnCount, 0.0) :> Matrix<float>) let step netProps prevDs input target = let layeroutputs = feedforward netProps input let Gs = gradients netProps.Weights layeroutputs input target (updateWeights netProps.Weights Gs prevDs) let train netProps samples epoches = let count = samples |> Array.length let rnd = Random(seed) let Ws, fs = netProps.Weights, netProps.Activations let rec loop Ws Ds i = match i < (epoches * count) with | true -> let inp, trg = samples.[rnd.Next(count)] let netProps = { Weights = Ws; Activations = fs } let ws, ds = List.unzip (step netProps Ds inp trg) loop ws ds (i + 1) | _ -> Ws let Ws' = loop Ws (initDeltaWeights Ws) 0 { netProps with Weights = Ws' }The
initDeltaWeights
function is just a helper function to initialize the delta weights matrices with all zeros at the beginning of the training. train
takes three arguments: a NnetProperties
record with all the $w_{ji}$ initialized to random floats, a training set samples
with input patterns and the corresponding target values and a number of epoches as an indication for how long we are willing to train the network.The actual work is done in the inner recursive loop function. An arbitrary input pattern and corresponding target value is picked from the sample, a
NnetProperties
record is constructed with the given weights and activation functions and then step
is called to update and return the new weights and delta weights. The function returns a new NnetProperties
record with the learned weights. To keep the following test runs reproducible I instantiate the Random
with a seed.A Simple Regression Example
To show that everything is working properly I will demonstrate it using a simple regression problem. Let's take the nonlinear function $sin(6x)$ in the range from -0.5 to 0.5 as the underlying process. From that I'll produce a random set containing 25 training samples with some noise added. Then I will construct a neural network with one hidden layer that contains 15 neurons (1-15-1 configuration). This network will be trained a number of times, each time with a different number of epoches.The code that does this:
/// returns the output vector from a given list of layer outputs let netoutput (layeroutputs : ('a * 'a) list) = fst (layeroutputs.Head) /// computes the output error from a /// given target and an actual output vector let error (target : Vector<_>) (output : Vector<_>) = ((target - output) |> Vector.map (fun x -> x * x) |> Vector.toArray |> Array.sum) / 2.0 let initWeights rows cols f = DenseMatrix(Array2D.init rows cols (fun _ _ -> f())) :> Matrix<float> let targetFun = fun x -> sin (6.0 * x) let computeResults netProps trainingset epoches = let netProps' = train netProps trainingset epoches let setSize = trainingset.Length let error = trainingset |> Array.fold (fun E (x, t) -> let outs = feedforward netProps' x let En = error t (netoutput outs) E + En) 0.0 let outputs = [-0.5 .. 0.01 .. 0.5] |> List.fold (fun outs x -> let layeroutputs = feedforward netProps' (vector [x]) let o = (netoutput layeroutputs).At 0 (x,o) :: outs) [] |> List.rev (error / (float setSize), outputs) let experimentSetting() = let rnd = Random(seed) let randZ() = rnd.NextDouble() - 0.5 let samples = [| for i in 1 .. 25 -> randZ() |] |> Array.map(fun x -> x, targetFun(x) + 0.15 * randZ()) let trainingSet = samples |> Array.map (fun (x,y) -> vector [x], vector [y]) let Wih = initWeights 15 2 randZ let Who = initWeights 1 16 randZ let netProps = { Weights = [Wih; Who]; Activations = [sigmoid; tanh]} (samples, trainingSet, netProps) let testRun experiment listOfEpoches = let samples, ts, netProps = experiment() let data = listOfEpoches |> List.fold (fun acc N -> let error, outs = computeResults netProps ts N printfn "mean error after %d epoches: %f" N error outs :: acc) [] |> List.rev (samples, data)Let's briefly go through the code:
netoutput
returns the actual output vector when given the layer outputs from the feedforward
function (which happens to be the first part of the first tuple in the sequence). error
computes an error given an output and a target vector. Usually training starts with an untrained network, e.g. the actual weights and biases are initialized to random values. That's what
initWeights
and randZ
do. targetFun
simply is our target function, in our case $sin(6x)$.computeResults
trains a given network with a given training set a given number of epoches and computes a mean error and produces the outputs from the network in the same range as our target function, namely in -0.5 to 0.5. This data is used for the plot below. experimentSetting
defines an experiment – it builds up a training set as mentioned above and constructs a 1-15-1 neural network with randomly initialized weights. Finally the testRun
function takes an experiment setting and a list of epoches to train the network and collects the output data.In F# Interactive the test run can now be started.
> let samples, data = testRun experimentSetting [75; 500; 2000];; mean error after 75 epoches: 0.043278 mean error after 500 epoches: 0.017267 mean error after 2000 epoches: 0.001506 val samples : (float * float) [] = [|(-0.02707827395, -0.1465612293); (-0.4550698469, -0.4159703715); (-0.04310941535, -0.2552857918); (0.4060334279, 0.6611061557); ... val data : (float * float) list list = [[(-0.5, -0.9956174648); (-0.49, -0.9952092317); (-0.48, -0.9947566762); (-0.47, -0.9942545452); (-0.46, -0.993696933); (-0.45, -0.9930771989); ...The following plot shows the result (click to enlarge) It can be seen that the network is learning the "nature" of the unterlying hidden data generating process. And that's what we wanted to achieve.
Conclusion
Cool. It kind of works. In this post I presented a working implementation of an improved Backpropagation technique, interpreted in a functional way.The Standard Backpropagation technique as the most fundamental learning algorithm for artificial neural networks sort of stands at the beginning of the success story of ANNs. I think it was also the first technique that was able to train multilayer neural networks at all. However, Backpropagation has also some severe drawbacks, the first being the fact that one has to define the learning rate $\eta$ (and in case of the improved version with the additional momentum term used here $\alpha$ as well!) – it turned out that this is never an easy task. If at all possible.
Great post. I look forward to your next post on iRprop+
ReplyDelete