Back Prop Problem

Started by
3 comments, last by Degski 18 years ago
I am writing a Neural Network that will take in 64 inputs (all normalised between 0-1) and try to reproduce the same at the o/p. It will have 16 hidden neurons. I am using simple back-prop to train it, but I only seem to cycling between 2/3 different weight vectors. I had initialised the network with small random weights ((double)rand()/RAND_MAX). What could be the problem? My textbook uses Conjugate gradient for training, but I dont want to use it, as I dont understand the Math behind. I am actually calculating the gradient vector of the weight vector, and then using w += grad w * learning rate as the modification step. Is it that I am trying to train a little too much using a rather simplistic algorithm? Or is the problem with my algorithm?

double Layernet:: find_grad(double *input)
{
int n = n_hid * (n_in + 1) + n_out * (n_hid + 1); //total number of neurons

	   
	for (int j = 0; j <n; j++)
		grad[j] = 0;

	double *hidgrad = grad;
	//double *outgrad = grad + n_hid * (n_in + 1);

	double diff; double error = 0.0;
	
	
	for (int k=0;k<n_out;k++)
	{
	diff = input[k] - out[k];
        error += diff * diff;
        outdelta[k] = diff * act_deriv (out[k]);
	}
	int l;

	//calc o/p gradient, of the weights connecting Hidden to Output
	double delta;
	for (int k=0;k<n_out;k++)
	{
	delta = outdelta[k];
	for ( l=0; l< n_hid;l++)
		grad[n_hid * (n_in + 1) + l + k * (n_hid + 1)] = delta * hid[l];
		grad[n_hid * (n_in + 1) + l + k * (n_hid + 1) + 1] = delta ;
	}

	int i,jj,k;

        //the hidden grads now
	for (i=0;i<n_hid;i++)
	{
		delta = 0;

		for (jj=0;jj<n_out;jj++)
		 delta += outdelta[jj] * out_coeffs[jj * (n_hid + 1) + i];
		delta *= act_deriv (hid);
         
        for (k=0; k<n_in; k++)
			*hidgrad++ = delta * input [k];
		*hidgrad++ = delta;
	}
return error / (double (n_out));
}

void Layernet::modify_weights ()
{
	int i,j;

	for (i=0; i<n_hid;i++)
	{
	for (j=0; j<=n_in;j++)
	hid_coeffs[j + i * (n_in + 1)] += .4* grad [j + i * (n_in + 1)] ;
	}

	double* newgrad = grad + n_hid * (n_in + 1);
	
	for (i=0; i<n_out; i++)
	{
	for (j=0; j<=n_hid;j++)
	out_coeffs[j + i*(n_hid + 1)] += .4 * newgrad [j + i*(n_hid + 1)] ;
	}
	
}
Thanks.
Advertisement
Some debugging has shown that the total_error of the network
(target[k] - observed[k])^2 actually starts 'increasing' after about 100 iterations. How is that happening? Any clues??
The code's a bit fuzzy to me, but first of the block, you don't need to calculate the gradient, you just need to keep the output of the units (which you'll need anyway, that's where your problem is). The gradient = output * ( 1 - output ), so forget the explicit gradients.

Then, w += grad w * learning is not complete, should be

w += grad w * learning * error * input

the error is obviously determined by the output (that's why you need to keep them). for the output layer this is (output-desired output) and for the hidden layer(s) this value is calculated backward through the network (the inverse way of how you calculated the output going forward.

In lin algebra terms its like:

going forward for all layers

sigmoid( ( weight-matrix * input ) ) = ( output next layer )

at the output layer

output error = ( output - desired output )

than backwards for all layers

weight-matrix-transposed * output error = input error (or output error previous layer )

and

weight adjustment = learning * ( output * ( 1 - output ) ) *

input * weight

This probably still not very clear, but looking at it like matrix/vector products simplifies your implmentation tremendously and you might be able to use some EFFICIENT linear algebra package! If you would use matrix-matrix multiply, with the same schema you could process (more efficiently) several inputs at the same time (for offline training). Lin Alg Packs usually do a better job at optimising matrix matrix multiply than consequtive matrix vector multiplies. Speed is really the crux!







Quote:
Then, w += grad w * learning is not complete, should be

w += grad w * learning * error * input


Page 163 of Neural Networks (Simon Haykin, 2nd Edition) shows

del w(i,j) = - eta * grad ---eqn (4.12)

where grad = partial derivative of the error function wrt w(i,j)

This equation should translate into del w = learning rate * grad (negative values have been already multiplied with grad, so grad here is the negative gradient)
Well, I don't want to argue with you, i looked it up in my handbook, which is
R Rojas, Neural Network (a systematic introduction) page 165-167 gives the above information. All I can say is that my networks work and you seemed to have a problem. Have a look at this link http://www.dontveter.com/bpr/public2.html there's a detailed numerical example, so that should help.

This topic is closed to new replies.

Advertisement