• ### Announcements

#### Archived

This topic is now archived and is closed to further replies.

# Chromosome encoding

## Recommended Posts

edotorpedo    198
Simple question My chromosome is made up of a string which can contain 4 different symbols. Should I encode these symbols by 00 01 10 11 or is it better like this: 0001 0010 0100 1000 (I think this approach is better, but I don''t know why) Thanks, Edo

#### Share this post

##### Share on other sites
Guest Anonymous Poster
Well, if you use integers instead, the second case make it easier to use bitmasks. Better still would be to use an enum - typically evaluated as an int but with the easier to remember name scheme
enum chromosomes { a = 2 b = 4 c = 8 d = 16};

#### Share this post

##### Share on other sites
Timkin    864
The only key issue you need to consider is that if you choose a binary encoding (be it in integer form, or packed bit string form) you need to ensure that any binary string that can be generated via the genetic operators (applied to the population) decodes to a valid value of the appropriate state variable.

If you were to use the second option you listed for your encoding, you would need to restrict crossover to limited places in the chromosome (allele boundaries). You would also need to implement an unusual mutation operator that was more complex than a bit-flip.

You could use the first encoding in a packed bit string and use bit masks to perform crossover. Mutation would be a simple bit flip.

Cheers,

Timkin

#### Share this post

##### Share on other sites
alexjc    457
Representation is KEY! A well designed model can outperform a GA even with a random search (ok, not always, but you get the point

If they''re four different concepts, then use one gene for each - single a bit will do fine. It''s the same variable with different values, then I find the integer approach is fine... as Timkin said, just make sure your crossover and mutation operators do actually search all the options!

Artificial Intelligence Depot - Maybe it''s not all about graphics...

#### Share this post

##### Share on other sites
polly    532
What the hell is a chomosome? I know what one is in biology, but whats the application in programming...?
Jon

#### Share this post

##### Share on other sites
edotorpedo    198
Thanks for all the replies.

I''m using this scheme to perform some kind of path finding (to solve slide puzzles). I therefore use a chromosome made up of 20 genes (if evaluated it can perform 20 moves). Using 0010,0001 etc, I neglect wrong genes and move on to the next. But if the solution requires only 5 moves to be made, wouldn''t the chromosome evolve to a solution in which the first 5 genes are the moves to a correct solution, and the other 15 genes are all incorrect (for example 1111) ?

#### Share this post

##### Share on other sites
alexjc    457
jonpolly:

Chromosome is not really the correct terminology for AI, genotype is prefered...

It''s basically just an array storing lots of genes (data), that can be artificially evolved - using genetic algorithms.

Artificial Intelligence Depot - Maybe it''s not all about graphics...

#### Share this post

##### Share on other sites
Guest Anonymous Poster
You have a bit more than 4 bits - thats what makes life fun.

Your point is crossing to an arena which in the ''molecular computing soup'' is getting rather warm on - perhaps you read about it DNA computing.

For the moment, the concept as your encoding idea sound quite nifty and are well broadcasted. However life - often likes error more than codes. How would or could you program around it. And would you even want to perhaps you could program on genetic error.

JL
www.bionicbros.de

#### Share this post

##### Share on other sites
Timkin    864
Just a note on terminology...

Early literature in the field of genetic computing made liberal use of the parallels between natural genetics and computational genetics when determining appropriate names for data structures and operators.

As in natural genetics, a chromosome in a GA is a collection of genes. The chromosome represents the genotype of the individual (its genetic make up) and when expressed in the environment (decoded) we obtain the phenotype (point in state space). Each gene, being a unit of genetic information, can take on one of a finite set of values, called alleles.

Some people - like myself - still tend to use such terminology when discussing genetic programming techniques, as it leaves little confusion for what is meant. Others though tend to shy away from such strong representational links between natural and artificial computation. I guess I still like to remain true to John Holland''s view of the field (he invented the canonical 3-operator genetic algorithm).

Timkin

#### Share this post

##### Share on other sites
Timkin    864
quote:
Original post by edotorpedo
Thanks for all the replies.

I'm using this scheme to perform some kind of path finding (to solve slide puzzles). I therefore use a chromosome made up of 20 genes (if evaluated it can perform 20 moves).

...so your chromosome represents a plan of action...

quote:
Original post by edotorpedo
Using 0010,0001 etc, I neglect wrong genes and move on to the next.

...do you mean you ignore actions in the plan that cannot be done (like trying to slide a tile into a wall)?

quote:
Original post by edotorpedo
But if the solution requires only 5 moves to be made, wouldn't the chromosome evolve to a solution in which the first 5 genes are the moves to a correct solution, and the other 15 genes are all incorrect (for example 1111) ?

Yes. In this case, you need to interpret your plan as being complete when after any step in the plan, the goal is found. Just perform a quick check on the state of the puzzle - after each action is performed - to ensure that it is, or is not, in the completed state.

Cheers,

Timkin

Edited by - Timkin on December 11, 2001 7:53:42 PM

#### Share this post

##### Share on other sites
edotorpedo    198

Yes,

wrong genes are just ignored. I thought about it, and I think the chromosomes will evolve to a situation in which 15 of 20 genes are all wrong when there are only 5 moves to be made. Especially when such a gene is evaluated with a greater fitness. That''s why I''m going for the encoding scheme using 4 bits in stead of 2, because genes with 2 bits are always correct.

By wrong genes I mean genes that perform illegal moves (into a wall or outside the field), or genes that do not represent a move(1111, 1101, 0011 etc).

Edo

#### Share this post

##### Share on other sites
Timkin    864
Ah, I see what you mean now...

There's absolutely no reason to NOT use a 2-bit encoding. In fact, the only thing a 4-bit encoding will do for you is slow your algorithm down by a very large factor(~n*2number of wasted bits where n is the number of genes in your chromosome) as the mutation and crossover operators are going to waste a lot of time working on these areas. You'll also need more population members to effectively search the larger domain. Additionally, there is unlikely to be a reasonable way to evaluate the fitness of a 'bad allele' as it doesn't map into your objective functions space (which is why you have chosen to ignore them). Stick with the 2-bit encoding for N,S,E,W movement actions. That way, every bit of the chromosome contributes to the solution.

Another question: What are you using as an objective function (fitness evaluation method)?

Timkin

Edited by - Timkin on December 12, 2001 8:38:33 PM

#### Share this post

##### Share on other sites
edotorpedo    198

Well,

I hate to admit it, but you are right, 2 bit encoding can achieve the same goal and is quicker. The fitness function I am using for my slide puzzle is very simple but also very effective (I already solved the slide puzzle problem using backtracking).

The original puzzle that must look like this:
1 2 3
4 5 6
7 8 * (* = emtpy cell in puzzle)
But looks like this:
3 8 1
7 5 4
2 * 6
Now for every cell I calculate how many moves it must perform to reach its original spot in the puzzle. So:
Cell 0,0: 2 moves to the right = +2
Cell 0,1: 2 moves down = +2
Cell 0,2: 2 moves to the left = +2
.
.
Cell 2,1: 1 move to the right, 2 up = +3
I add all those number and the puzzle that has the least number of points (closest to the solution) has the greatest fitness. You can see the original puzzle has 0 number of points, because every cell is on its original spot.

Edo

#### Share this post

##### Share on other sites
kirkd    505
I agree with Timkin that the 2 bit encoding is the best approach. Way too much time will be wasted on invalid 4 bit results. Just my 2-bits..uh....I mean...my 2 cents. 8^)

Another question, in your last post you describe your fitness function, and it appears that what you will evolve is the solution to a specific puzzle, correct? In other words, if I present the resulting solution with a randomly scrambled puzzle, it will likely fail to produce the solved puzzle. Have you considered ways to get around this or to work toward a general solution? Just curious.

-Kirk

#### Share this post

##### Share on other sites
Timkin    864
Just a couple of points that may (or may not) help you improve your solver...

Using the Manhattan (block) distance as you are is a good heuristic for the tile-puzzle. However, it is a heuristic estimate of the cost of completing the puzzle, rather than of the fitness of any particular state of the game board, or fitness of any plan to achieve the goal state.

If there is a 1-1 mapping from cost space to objective space then we may be able to find an objective function to evaluate fitness based on the cost function. Indeed, for many domains, taking the objective function as F(x)=1/C(x), (where C(x) is the cost function and F(x) our objective function) is a good enough mapping to solve the search/optimisation problem.

The benefit of using the above cost function rather than F(x) = MaxValue - C(x) is that you are guaranteed never to have a negative fitness value; which doesnt really make much sense and causes problems for your algorithm... i.e., you need to continually rescale the objective space).

Onto another point...

There is another way of dealing with the fact that a given population member may solve the puzzle using less than the number of steps encoded into its chromosome. Include the null action. Obviously this increases the search space because the branching factor of the search tree is increased form 4 to 5, however it reduces the need to check whether the puzzle is solved after every single step in the plan is evaluated. You can determine which is computationally cheaper.

Finally, if after you get your solver working you would like some optimisations to get your GA to find the solution faster, give me a holler. Many years ago I developed a method to get around the GAs nature of asymptotic convergence to the optimal population member. The speed increases are fairly dramatic.

Cheers,

Timkin

#### Share this post

##### Share on other sites
edotorpedo    198

Well, I''ll try to comment on questions/remarks.

I don''t really know what KirkD means, because I do randomly scramble my puzzle, and let the GA solve it.

I do think this is a real fitness function, because it achieves higher fitness (I already thought of 1/C(x), because I really wouldn''t know what MaxValue in a puzzle like this is) as it becomes closer to the solution. I don''t know if you guys have this in mind, but the moves I refer to are the moves of the blank space in the puzzle.

I originally thought of implementing the null action in my plan of a 4 bit encoded gene. I cannot implement the null action in a 2 bit encoded gene. I do however think that if the puzzle is one move away from being solved, the solution will first bump 19 times (in a 20 gene chromosome) into a wall before making the right move, simply because this solution will have the highest fitness.

About the optimizations, I think I use pretty fast code, while still obtaining OO features. I don''t know how to code inline assembly, but for now I think it will run fine on my 1.2 GHz

#### Share this post

##### Share on other sites
Timkin    864
quote:
Original post by edotorpedo
I originally thought of implementing the null action in my plan of a 4 bit encoded gene. I cannot implement the null action in a 2 bit encoded gene. I do however think that if the puzzle is one move away from being solved, the solution will first bump 19 times (in a 20 gene chromosome) into a wall before making the right move, simply because this solution will have the highest fitness.

Not necessarily. Any chromosome that ends in the final solved state would presumably have the same fitness. This is where the null action has benefit because you could penalise a chromosome for using more moves than necessary to achieve the goal state. Let''s say that you start with a puzzle that has only 1 piece out of place. The optimal plan would be to simply move that piece into its correct place. However there are many plans that move more pieces that arrive at the goal state... but they are not optimal. If you don''t care about the optimality of your plan, then don''t worry about this. If you do, then you need to incorporate the notion of an optimal plan into your solver.

quote:
Original post by edotorpedo
About the optimizations, I think I use pretty fast code, while still obtaining OO features. I don''t know how to code inline assembly, but for now I think it will run fine on my 1.2 GHz

Oh, I didn''t mean code optimisations, I meant algorithmic optimisations. I can guarantee that I can write un-optimised to solve a problem with a GA that will find the solution at least an order of magnitude faster than most optimised code people write. It stems from research I have done into GAs.

Cheers,

Timkin

rtr1129    122
-

Merge2    122
--!

#### Share this post

##### Share on other sites
edotorpedo    198

Well, you''re right (again, grrr.. ) about the different paths to go to a final position. I think I can avoid this by substacting a small number (say 0.0001) from the fitness value for every (legal) move it makes. That way the shorter path of two similar paths (ending in the same final state) will have a higher fitness.

I''m quite interested in your algorithm optimizations. How does it (globally) work. Do you choose 4 in stead of 2 *best* chromosomes from your population. That''s about the only way I can think of on how to achieve a solution by a magnitude faster.

Edo

edotorpedo

#### Share this post

##### Share on other sites
Timkin    864
The way to obtain order of magnitude speed-ups in GAs (or even better) is to understand why GAs have an asymptotic convergence rate. It was generally thought that the selection operator weakens as the population nears convergence to a single - most fit - chromosome and that this slows down the search. Others suggested that poor performing bits were hitchhiking on the backs of very strong bits and that the GA was poor at breaking such correlations once they were significantly strong. Back in 1995 I showed that it was in fact the crossover operator that was at fault.

As the population converges the probability that two parents will produce cloned offspring (i.e., identicical to the parents) increases. It is obvious that when the population has fully converged to a single string that the probability of cloning is 1. What surprises most people is that on a random population before any iterations of the GA the probability of cloning can be as high as 50%! It''s a function of the length of the string.

Clearly then, if we can avoid clones then the search will progress further. The naive approach might be to simply reject clones as they are produced and perform crossover again on the parents. Unfortunately, this doesn''t actually prevent asymptotic convergence - it merely alters its rate - as a lot of time will be spent finding and rejecting clones. You might get a factor of 2 increase in performance of the algorithm, but thats about it.

The solution then is to prevent the population from converging! But, you say, this is going against the whole point of the algorithm! And you''d be right! I don''t have the space here for a full explanation but basically you prevent the population from converging by removing individual bits (from every population member) that have already converged (over a reasonable threshold: like 95%). Chromosomes will converge from most significant bit (MSB: the bit that contributes the most to the fitness evaluation) down to the least significant bit (LSB: which contributes the least). To maintain the effectiveness of the algorithm, you stick a new bit (randomly selected) into a new position in the string such that the new position is now the least significant bit. Ie, you add as little perturbation to the string as possible while changing the information it encodes.

The bits that you pull out you ''lock'' into your solution string.

The beauty of this approach - which I called Bit Stripping - is that you can run it for as long as you like to obtain whatever level of accuracy you desire for your optimisation problem.

I did lots of tests against standardised sets of test problems and found that my GAs outperformed standard GAs over the whole suit of problems. Performance benefits varied from several orders of magnitude on uni-model problems to same order of magnitude on multimodal stochastic functions. I certainly proved that you were no worse off for employing my method and that it was fairly easy to do so.

Additionally, I re-derived Holland''s Schema Theorem to include the effects of cloning. The resulting formula was as elegant as Holland''s original and more accurately explained the asymptotic convergence rate of Holland''s 3 operator GA.

If you have any specific questions, just holler.

Timkin

#### Share this post

##### Share on other sites
Geta    136
quote:
Original post by Timkin

I did lots of tests against standardised sets of test problems and found that my GAs outperformed standard GAs over the whole suit of problems. Performance benefits varied from several orders of magnitude on uni-model problems to same order of magnitude on multimodal stochastic functions. I certainly proved that you were no worse off for employing my method and that it was fairly easy to do so.

Additionally, I re-derived Holland''s Schema Theorem to include the effects of cloning. The resulting formula was as elegant as Holland''s original and more accurately explained the asymptotic convergence rate of Holland''s 3 operator GA.

If you have any specific questions, just holler.

Timkin

Do you have a pointer to your published paper on this?

Thanks,

Eric

#### Share this post

##### Share on other sites
Timkin    864
quote:
Original post by Geta

Do you have a pointer to your published paper on this?

On the (mis-guided) advice of my then supervisor I did not publish this work externally (only internally). This was based on the fact that two other researchers (Shraudolph & Belew) had devised a similar, but computationally more complex method that achieved similar results. In hindsight, their method was not based on the same understanding as mine and I probably could have published... maybe I still should!

Anyway, I only have hardcopies of the thesis. If you''re desperate for a copy, email me a snail-mail address and I will dispatch you a copy.

Cheers,

Tim