Using GAs to find initial weights

Started by
5 comments, last by kirkd 18 years, 3 months ago
When we use a GA to find initial weights of a NN, I understand that we use a random population of weights, and then using the number of epochs reqd for convergence as a fitness function, evolve/breed a new set of weights. So now we have a second population. We still dont know what the optimal (or near-optimal) weights are. Repeating the cycle, to generate another population, and then another, and so on will only waste time given that we are also usually a Gradient-Descent method as a fitness function. So is the price one needs to pay to find a global minimum so computationally expensive, or have I got something wrong. Such a process which involves (1)generate populations, (2)see number of epochs taken to converge, (3)crossover, (4)mutate, (5)get new generation.....(6)Repeat.... is too much. I am sure there is a gap in my understanding, please help me out. What I mean is that I must have got the method wrong, the GAs must be used in some other way, which is a nice trade-off between the extra computation, and the benefits of achieving a global minimum of the error function. But I just can't seem to find out how.
Advertisement
If I read this correctly, your fitness function is the number of epochs needed to converge and you want to minimize this, correct? I would guess that the result would be that your "starting weights" would effectively tend toward your final weights. The weights of the network would essentially tend toward a minimum. The end result is that you're evolving the gradient descent determined weights.

The question still remains, is this a global minimum or a local? Under this scheme (as with most any other) you won't know.

-kirk

I've never heard of using GAs to find initial weights for a network in the way you're describing.

Usually the fitness function is just based on the error from the network, or the performance of the network in doing whatever it is supposed to do. The GA is used as the learning strategy by itself, without backpropagation or other methods. I guess after getting a satisfactory solution with a GA a gradient decent algorithm like backpropagation could be used, but I think usually the GA is considered capable of the fine tuning needed anyway.

Using a genetic algorithm to set the weights and using the number of training epochs needed for the network to converge is nonsense. You should just use the magnitude of the errors as the fitness function directly.
Thanks.
That sounds very reasonable. Using the GAs to find initial weights that would minimize the error function. This would mean I wouldnt need any Back-Propagation to fine tune it any further. When I suggested the idea, I was actually referring to this paper "Feedforward Neural Network Initialization: an Evolutionary Approach ". Find the paper here You could take a look, and may be able to tell me what they actually meant.
For my purpose, however, I shall use the approach outlined in one of the replies, use a GA with the Fitness function being inversely proportional to the Mean Square Error. And not use any back-prop.
But I think using simulated annealing means that I still have to use. Or is simulated annealing without any back-prop done too. I mean, atleast on a programme that sounds easier, than interleaving it with back-prop.

I read the paper and the approach seems somewhat reasonable, but it still seems unnecessary. I would think that a properly implemented GA would find an equally good solution without an excessive computational cost.

It would be a good experiment to compare the two methods and see which gives a better cross-validated result in terms of error on a test set of data.

-Kirk

Yes, Kirk, I think some experimentation could lead to more concrete conclusions on this. But all said and done, I think the computational overhead is a bit excessive. GAs + Back Prop for each chromosome in the population!

And yes, like if I wanted to actually try this out, what kind of Data do you think would be most appropriate. This is a question that my textbooks never answer. I might try to, let's say, filter noisy signals, and obtain some results on the NNs effectiveness/performance (trained once with only a GA, and in the second using GA + Back propagation). But someone else might try to do it on Time Series Data, or for the purposes of Pattern Recognition. Is it possible that this other person might end up with *very* different results (regarding the performance of the trained nets)? The answer to this will be very helpful for my project work, as in the limited time I have, I can test, and draw conclusions on maybe just one form of Data (or one application of the NN).

Thanks.
I'm biased toward pattern recognition, but that's just my bend. If you wanted to go that route, you can easily find tons of data at the UCI Machine Learning Repository (http://www.ics.uci.edu/~mlearn/MLRepository.html). There are a large number of data sets here with predetermined features and some even give splits into training/test sets. Even better, you can use the data sets for free.

I agree that the computational overhead of the combined system is pretty big. But, is there a balance? Using GA alone would take longer to find the minimum if it actually found it at all, hence more generations/evaluations. On the other hand, gradient methods will find the local minimum quickly, but you'd have to use it on every member of the population. Don't you love the No Free Lunch Theorem?

-Kirk

This topic is closed to new replies.

Advertisement