devexploria

Thursday, January 24, 2013

Functional Feed-forward Neural Networks, Part 2: Backpropagation Learning

This time I will deal with the learning problem. Feed-forward neural networks operate in two distinguishable ways, the first being the feed-forward computation, in which the network is presented some input data in the form of a vector and this input is passed through the network to yield an output. This is what I talked about in the first part of this blog series and can be found here: Functional Feed-forward Neural Networks, Part 1: Setting it up.
In the second operational mode the network is presented some input along with a desired output – these are called target or training values – and the goal is to change the parameters of the network to bring the computed output values as close as possible to the target values. In this sense changing the parameters means learning the weights (and bias values) of the network.
Maybe the most fundamental learning algorithm for feed-forward NNs is the so called Backpropagation technique, and I will use that technique here to solve a simple nonlinear regression learning problem.

How does Backpropagation work?

The general idea behind Backpropagation is pretty intuitive:

The network is presented a test input vector (with known output).
The input is propagated through the network.
The actual output is compared to the desired output (target value). The difference between them is the error.
While propagating the error the whole way back to the input layer the weights are updated according to their influence on the error.

The error for a particular input vector $n$ is: $$E_n = \frac{1}{2}\sum_k(y_k(\mathbf{x}_n, \mathbf{w}) - t_{nk})^2$$ The overall error (over all input patterns): $$E(\mathbf{w}) = \sum_{n=1}^N E_n(\mathbf{w})$$ We are now interested in the partial derivatives of this error function $E_n$ with respect to weight $w_{ji}$. The second part of the following equation shows the chain rule for partial derivatives, $z_i$ denoting the output or activation of unit $i$ and $\delta_j$ the error signal of unit $j$. $$\frac{\partial E_n}{\partial w_{ji}} = \frac{\partial E_n}{\partial a_j} \frac{\partial a_j}{\partial w_{ji}} = -\delta_j z_i$$ That leads to the following delta weight rule, $\eta$ denoting the learning rate: $$\Delta w_{ji} = -\eta \frac{\partial E_n}{\partial w_{ji}} = \eta \delta_j z_i$$ with $$\delta_j = \begin{cases} h'(a_j)(y_j-t_j) & \text{ if } j \text{ is output neuron} \\ h'(a_j)\sum_{k}w_{kj}\delta_k & \text{ if } j \text{ is hidden neuron} \end{cases}$$ Following figure (again taken from Bishop's PRML book) illustrates the calculation of the error signal $\delta_j$ for the hidden unit $j$: The $\delta$'s from units $k$ are propagated back according to the weights $w_{kj}$.

The weight update rule then looks as follows: $$w_{ji}^{(t)} = w_{ji}^{(t-1)} + \Delta w_{ji}$$

Functional Feed-forward Neural Networks Part I: Setting it up

This is the first of several posts in which I will go into some subjects concerning artificial neural networks (NN) and their functional implementation. Today, I will set up a classical feed-forward neural network. In future posts I will show how to train and use such a network.
I won't go much into the theoretical details of neural networks as they are covered exhaustingly elsewhere. There are plenty of resources you can check out: books, videos, online material,... whatever you like. Following a list of books I can recommend that cover (not solely) NN:

Pattern Recognition and Machine Learning (Bishop)
Pattern Classification (Duda, Hart, Stork)
Neural Networks for Pattern Recognition (Bishop)
(I don't own this one myself, but I heard only good things...)
The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)

Although there is a lot of discussion going on about NN and their widespread use in the fields of Machine Learning, Data Mining, Computational Statistics, Data Analysis and so on I've seldom seen them in conjunction with functional approaches. That's why I was wondering how they would fit.

The Setting

The picture below shows a schematic diagram of a NN, taken from Bishop's PRML book, as I will mostly stick to his nomenclature (the image can be found here):

On the left side is the input of dimension D, in the middle is a so called hidden layer of dimension M and on the right side is the output (of dimension K). In the picture there is only one hidden layer, but there can be any number of them in a network; and they can all be (and usually are) of different dimensions.

A Solution For Project Euler Problem 67 in F#

I'm a fan of the projecteuler site. This site lists some 300+ problems everyone is invited to solve. Some of them are relatively easy, some very tricky. As Wikipedia states:

«... Project Euler (named after Leonhard Euler) is a website dedicated to a series of computational problems intended to be solved with computer programs. The project attracts adults and students interested in mathematics and computer programming. As of 24 January 2012, it includes 368^[1] problems of varying difficulty, each solvable in less than a minute using an efficient algorithm on a modestly powered computer. ...»

From time to time I stop by and pick one or more of the problems and try to implement them in different languages. This is always a very interesting experience because it teaches me a lot about the differences between programming languages, their constructs and - of course - the pros and cons of different programming styles. This by the way can also be a very good and fun way to get your hands dirty while learning new languages.

The Problem

Recently I stumbled upon an interesting problem - problem number 67. Given a triangle of numbers, by starting at the top and moving to adjacent child nodes one level below, you are to find the maximum total from top to bottom by adding the respective values. It follows the example from the projecteuler.net site:

The resulting total in this example is 3 + 7 + 4 + 9 = 23.
In fact, problem 67 is the same as problem 18 with the little, but, as you will see in a moment, very important difference of a bigger problem size.

On the way to solve that bastard

In fact, problem 67 is the same as problem 18 with the little, but, as you will see in a moment, very important difference of a bigger problem size.