Sign in to follow this  
choffstein

Representing decision trees as 'genes'

Recommended Posts

choffstein    1090
I have been working on a design that uses decision trees, made from nodes of logical operators (and, or, not), specific actions, and descriptors. At the end of the day, everything computes to a 'true' or 'false' value to determine if action should be taken. Now, the plan was to create this tree in a random fashion, choosing from ands/ors/nots randomly to link together these specific actions and descriptors. The tree length would be arbitrary (as in, random -- but probability of the tree growing decreasing as the tree got bigger (and exponentially decreasing at that)). Now my issue is this -- each one of these decision trees represents a possible thought process. I am trying to optimize this thought process. My first thought was 'genetic algorithms'. I figured I could use constant values to represent the specific actions and logical operators to create a gene sequence (simply storying the binary tree in an array for mating and mutations). The only issue here is, with the trees being arbitrarily long, it would be impossible for mating to occur -- unless I capped the size at N levels deep and just made an array long enough, using the value 0 for children of a leaf. The other possibility was this: for each tree created, parse all sub trees into parts and store them into the 'parts' drawer. For each part in a tree, that parts' weight value would be equal to the aggregate fitness values for all trees it is in. The higher the weighting on a specific part, the better chance of that part being selected for the next generation. This specific implementation seems like it would be far more costly memory wise than the last, and a bit harder to introduce 'mutations' into (though, I suppose it could be done). Does anybody have any ideas on how to tackle this beast? I know it is a bit obscure, but for the sake of the project I am working on, I can't really say much more. Thanks!

Share this post


Link to post
Share on other sites
Vorpy    869
Well, the first thing I'd do is search for terms like "genetic algorith" and "decision tree" in google. Doing that brings up some papers that are available online about some of the approaches that have been tried in using genetic algorithms to optimize decision trees.

Usually with genetic algorithms and graphs or trees, there are mutating operators designed to act on entire subtrees. Altering a single node allows for optimization of that particular node, but performing cross-over on nodes that share a common position in a binary tree is almost pointless since different trees might be using that part of the subtree for entirely different purposes. I think I have seen papers that discuss doing things like tracking how effective certain sub-trees or sub-graphs are performing, in the context of evolving neural network topologies.

Share this post


Link to post
Share on other sites
Timkin    864
I seem to recall that one of our members (kirkd) has done some work on evolving decision trees. You might consider sending him a pm.

Cheers,

Timkin

Share this post


Link to post
Share on other sites
AngleWyrm    554
Boolean Algebra (pdf) might be useful for simplifying random expressions.

Summary: ( OR = ∨, AND = ∧, NOT = ¬ )

  • OR is closed: ∀ x, y ∈ B, x ∨ y ∈ B
  • 0 is the identity for OR: x ∨ 0 = 0 ∨ x = x
  • OR is commutative: x ∨ y = y ∨ x
  • OR distributes over AND: x ∨ (y ∧ z) = (x ∨ y) ∧ (x ∨ z)
  • AND is closed: ∀ x, y ∈ B, x ∧ y ∈ B
  • 1 is the identity for AND: x ∧ 1 = 1 ∧ x = x
  • AND is commutative: x ∧ y = y ∧ x
  • AND distributes over OR: x ∧ (y ∨ z) = (x ∧ y) ∨ (x ∧ z)
  • Complement (a): x ∨ ¬x = 1
  • Complement (b): x ∧ ¬x = 0

Theorems:
  1. x ∨ 1 = 1
  2. x ∧ 0 = 0
  3. ¬0 = 1
  4. ¬1 = 0
  5. x ∨ x = x
  6. x ∧ x = x
  7. ¬(¬x) = x
  8. x ∨ x ∧ y = x
  9. x ∧ (x ∨ y) = x
  10. x ∨ ¬x ∧ y = x ∨ y
  11. x ∧ (¬x ∨ y) = x ∧ y
  12. x ∨ (y ∨ z) = (x ∨ y) ∨ z
  13. x ∧(y ∧ z) = (x ∧ y) ∧ z
  14. ¬(x ∨ y) = ¬x ∧ ¬y
  15. ¬(x ∧ y) = ¬x ∨ ¬y

Share this post


Link to post
Share on other sites
MDI    266
How is your data represented? Do you have a set of pairs, say (X, True), (Y, True), (Z, False) where X, Y, Z are inputs and the second component of your pair is the output?

There's probably a better way to do this than using GA's.

Share this post


Link to post
Share on other sites
choffstein    1090
Well, because I am utilizing this method for pattern matching (in a broad scope of the word). For example, I want to be able to evolve a tree that can decide whether or not a certain number is within a range. So lets say I give it the range -10 and 10, and then an array like [1, 5, -11, 3, 7, 9]. I want it to go through those numbers, and give back a true false (1/0) based on whether or not it believes the number to be within the range. The fitness function of the tree is basically how many it got right -- and the size of the tree (with much higher weighting on the number correct).

So when you say 'how is your data represented' I am a bit confused. I know you probably think a neural network is more appropriate for this type of pattern classification and learning, but I have more than oversimplified what I am doing (because I cannot say much more).

Thanks.

Share this post


Link to post
Share on other sites
MDI    266
No, I was thinking of building the tree directly, not having to evolve it, using the decision tree learning algorithm. There's probably an outline of it in AI:AMA.

However, if you're sure you want to use GA's, I can see possible ways of getting around the problem with different sized trees. One, obvious, way would be to cap the depth of the tree, and keep it constant. Of course, the complexity of your boolean function will determine the size of your trees, to some extent. I think this is the method Holland (?) used when evolving Lisp programmes, although I may be misremembering.

Share this post


Link to post
Share on other sites
kirkd    505
Thanks for remembering, Timkin!

I have indeed worked on evolving decision trees. I won't go into the extreme details of it all here (unless you ask), but the general idea I used was to represent the tree with a linked list. Each node in the tree has three links - parent, left_child, and right_child. The root node of the tree has parent set to NULL and each terminal (leaf) has left_child and right_child set to NULL. This type of structure makes parsing the tree very easy. Regarding crossover or mating of two trees, all you have to do is choose a node from each tree at random, and then swap their parent linkages. This will effectively swap the two subtrees.

I'm more than happy to answer any questions you may have, if I can. As I mentioned, I've been doing decision tree evolution for a long time, so I've probably bumped up against some of the obstacles you'll be bumping up against.

-Kirk

Just in case you're interested, here's a link to the publication I got describing this work: http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2004/44/i03/abs/ci034188s.html

Share this post


Link to post
Share on other sites
Fisco    122
I would have a good look at first order logic. You could also build decision trees from an example set. Both methods and others on algorithmically constructing good decision trees can be found in "Artificial Intelligence a modern approach" by Russell and Norvig.

Share this post


Link to post
Share on other sites
Fisco    122
I would have a good look at first order logic. You could also build decision trees from an example set. Both methods and others on algorithmically constructing good decision trees can be found in "Artificial Intelligence a modern approach" by Russell and Norvig.

Share this post


Link to post
Share on other sites
AngleWyrm    554
Quote:
Original post by visage
The other possibility was this: for each tree created, parse all sub trees into parts and store them into the 'parts' drawer. For each part in a tree, that parts' weight value would be equal to the aggregate fitness values for all trees it is in.

The concept of a gene within the dna strand could be modelled by using the boolean theorems, and may produce better 'parts' than a purely random approach.

For instance, a section of one tree might read:
"...x OR NOT x AND y..." which could be swapped with:
"...x OR y..." because they are functionally identical.
sunny OR NOT sunny AND hot = sunny OR hot.

Then the concept of parts makes some sense, so swapping can be done by gene-phrases. the "x OR y" gene could be inherited in place, or it could be exchanged for different gene, like maybe "x AND NOT y".

[Edited by - AngleWyrm on July 9, 2007 6:15:36 PM]

Share this post


Link to post
Share on other sites
kirkd    505
I just wanted to pitch in my $0.02 on standard decision tree induction vs evolutionary decision tree induction.

Standard induction is very simple and involves looking for the best possible split at each node. Starting at the root, we ask the question, which split of the data gives me the best purity in the child nodes? Purity here is related to accuracy - how can I best split the data such that the resulting partition gives me the best possible grouping of classes in the child nodes? This is repeated at each node until you reach a termination criteria - not enough observations to split up, tree is maximum depth, the node is pure (all one class), etc.

The problem I have with this is that it is strictly a greedy induction. The question I asked was, what if I choose a suboptimal split at some early point in the tree that gives me a better set of options later in the tree? You could do this by look-ahead techniques, but this becomes very computationally expensive. I chose to use evolutionary methods because I am effectively evaluating the whole tree at once and not just a single split at a time. This allows us to evaluate interactions within the data that may not be obvious from the greedy approach.

-Kirk

Share this post


Link to post
Share on other sites
Steadtler    220
Quote:
Original post by kirkd
I just wanted to pitch in my $0.02 on standard decision tree induction vs evolutionary decision tree induction.

Standard induction is very simple and involves looking for the best possible split at each node. Starting at the root, we ask the question, which split of the data gives me the best purity in the child nodes? -Kirk


That was a long time ago, but I remember studying building decision tree based on information theory (ID3). If I remember correctly, by choosing the split
that maximize the information gain, you tend to build an optimally balanced tree. Optimal in the Occam's razor sense. While it does not always yield the most optimal tree, its a pretty damn good heuristic. Is the information gain (different of entropy) what you defined as "purity", or are you talking about Gini impurity? Im guessing the later.

Unfortunatly, access to your article is protected. Exactly what decision tree classifier did you compare it against? The abstract only mentions "recursive partitioning" which can mean a bunch of algos. Im guessing CART, but did you compare it to ID3, C4.5 or C5? (God, I hate when reviewers ask me those questions).

Also, the abstract only mentions classification accuracy. What about the depth of the produced trees? The whole purpose of using decision trees is to reduce the number of features you have to consider to classify an example...

Forgive my curiousness, Id very much like to read the article.

Share this post


Link to post
Share on other sites
kirkd    505
Steadtler,

Thanks for the interest and the questions. I'm happy to provide answers as best as I can.

Quote:

That was a long time ago, but I remember studying building decision tree based on information theory (ID3). If I remember correctly, by choosing the split
that maximize the information gain, you tend to build an optimally balanced tree. Optimal in the Occam's razor sense. While it does not always yield the most optimal tree, its a pretty damn good heuristic. Is the information gain (different of entropy) what you defined as "purity", or are you talking about Gini impurity? Im guessing the later.


I left my original reference to "purity" fairly vague due to the enormous number of possibilities here. Unfortunately, smart folks like you pick up on those things and start asking questions. 8^) Seriously though, the most common purity measures I've seen are accuracy, Gini Impurity, and Twoing Rule. For those not familiar with these, accuracy simply tries to get as many observations of the same class into each single child node. Very straight forward and easy to understand, but in my experience trees trained with accuracy as the splitting rule tend to do poorly on external data. Gini Impurity, if I remember correctly, has the effect of trying to seperate the largest class from all others at each split. Twoing Rule tries to divide the dataset into equal halves with as little class mixing as possible. You also mentioned information gain, which I belive is the rule used in C4.5 and C5 from Quinlan. I've seen various papers that compared splitting rules and it seems the no free lunch theorem rules supreme in that no one splitting rule always performs better than the others across datasets. Gini, Twoing, and Information Gain seem to have emerged as the pick of the litter and I almost never see other options.

For my evolutionary programming method, each tree is given a Minimum Description Length score. This is equivalent to what Quinlan uses in C4.5 for pruning and, without going into too much detail, is a measure of accuracy on the training set and size of the tree. More accurate, smaller trees give lower scores, so I minimize the MDL score.

Quote:

Unfortunatly, access to your article is protected. Exactly what decision tree classifier did you compare it against? The abstract only mentions "recursive partitioning" which can mean a bunch of algos. Im guessing CART, but did you compare it to ID3, C4.5 or C5? (God, I hate when reviewers ask me those questions).


I had forgotten that you need a subscription to get to my article. Send me a PM with your e-mail address and I'll send you a PDF reprint.

The comparison I used was CART as implemented in S-PLUS v7. I did not do a direct comparison against the others. I made the assumption that the various methods would likely perform similarly. Maybe not a valid assumption, but... Generally speaking, the evolutionary programming method seems to outperform most other methods I've compared it to across all datasets I've worked with, including SVM, NN, k-NN, etc.

Quote:

Also, the abstract only mentions classification accuracy. What about the depth of the produced trees?


Very good point. The way I did my comparisons was to impose different stopping criteria - 5 observations per node, 5% of the training set size, and 10% of the training set size. The CART trees do well when you grow them down to 5 observations per node and then prune back. They do less well when I force the leaves to have no less than 5% of the training set size, and they are almost useless with 10% of the training set size per leaf. Evolutionary programming almost always gets the same accuracy for 5% and 10% stopping criteria, and this is usually better than for any of the CART methods. When I use 10% training set size as the minimum leaf size, the trees tend to have a depth of around 3-5 with about 6-8 splits total.

I should also mention that the accuracy I'm refering to is on an external test set. I did all the original studies using 5-fold cross validation in which I held out a random 20% of the data, induced the trees on the remaining 80%, and then evaluated the accuracy on the 20% held out data. The results are then averaged over the 5-folds. I typically also repeat the 5-fold cross validation 5 or 10 times just to make sure the averages aren't affected by a really bad split of the data. When I do this 5-fold cross-validation repeated 10 times, the average performance is almost identical to the median, and it always beats or ties the other methods on the external prediction sets. At least in my hands, that's how it works. (How's that for an academic reply??) 8^)

Quote:

The whole purpose of using decision trees is to reduce the number of features you have to consider to classify an example...


This brings up another very interesting point. I looked at feature usage on a large set of trees, and saw some very interesting results. The procedure was to grow a large number of trees by evolutionary programming, say 100. Then, I computed the number of times each feature occurs across the whole set. The idea was that the stochastic nature of evolutionary programming leads to different trees from run to run. But, if some features are truly more important than others, they should be selected more often. From the usage frequencies and the total number of features used, I compute the probability that a feature would occur that many times at random. If I have n features, the expected value is 1/n if features were just being chosen at random. What I end up with is a small set of features that occur so often that the probability of random occurance is effectively 0. Not too surprising, but what I didn't anticipate was that a very large set of features occur zero times. These features were effectively selected against as they did not contribute to small, accurate trees. If I then go back and reduce the dataset to only those that occured frequently, I get another boost of about 2-3%. One dataset I remember specifically had 197 features and only 19 of those occured more often than would be expected from random occurence. Not bad for feature reduction, I think.

And, just to make this post even longer (!!!), I have been working on combining trees into an ensemble or forest. From comparisons across 6 datasets (for which I have access to comparisons with other methods like SVM, NN, k-NN, CART, Random Forest, Stochastic Gradient Boosting) my evolutoinary programming method is usually within 1-2% of the best method, Random Forests. Sometimes a little higher, sometimes a little lower than Random Forests, so effectively the same. The nice part is that my forests tend to be around 15-20 trees, while Random Forest usually needs ~100 or more.

If you're still awake, congratulations! I've been working on this stuff for quite a while now, so I tend to get passionate when people ask questions. If you have the strength, feel free to ask questions.

-Kirk

Share this post


Link to post
Share on other sites
Steadtler    220
Quote:
Original post by kirkd
If you're still awake, congratulations! I've been working on this stuff for quite a while now, so I tend to get passionate when people ask questions. If you have the strength, feel free to ask questions.

-Kirk


Im still awake, what with coffee and all. Thanks for the answers, you can soak your bloody fingers in rose perfumed water now :) Looks very interesting, tho I am always sceptical of evolutionary methods (which are essentially a gradient search with a lot of random involved, IMO). Learning methods are a b*tch to study, as people tend not to know very well the methods they compare their own research against*, so thanks a lot, you just made me a better researcher ;)


*Ive seen a paper comparing a NN with a SVM using a -linear- kernel... sad.


Share this post


Link to post
Share on other sites
kirkd    505
SVM with linear kernel vs NN? Wow. That is sad.

I agree that evolutionary computation seems to get applied to everything in the world, when it probably should only be applied to a few specific cases. In my case, I think EC takes some of the greedy out of tree induction. Just to test the randomness of it all, I set up a very long run of purely random tree induction with no EC steps. The purpose was to see if I could just generate trees at random and eventually find one that was as good as those that were evolved. Even after letting it run for days (100s of millions of random trees evaluated), I never found a better one.

As a general modeling procedure, this one seems to work very well, and I have yet to find anything that consistently outperforms it. That being said, it is probably very dependent on my domain of research. Computational chemistry tends to require models of very non-linear, multiple mechanism-based, strongly interdependent systems with features that never occur in "nice" distributions. ugh. Why do I do this stuff??

-Kirk

Share this post


Link to post
Share on other sites
Timkin    864
Quote:
Original post by kirkd
For my evolutionary programming method, each tree is given a Minimum Description Length score.


Why, particularly, did you use MDL as opposed to MML?

As for the chicks digging this stuff... who? Where? How do I meet them? My wife just frowns and walks away whenever she stops to peer over my shoulder at what I'm working on. She doesn't even bother to ask 'what does that all mean' any more! 8(

Cheers,

Timkin

Share this post


Link to post
Share on other sites
kirkd    505
Yeah, I have a similar response from my wife. Usually it's more of a general glazing of the eyes and occassionally I have to wake her up after I've been talking for a while. Ah, well...

As for MML vs MDL, the formulation for MDL uses a binomial distribution of the predictions which I can reasonably estimate by combinatorics computations. This is used to get a handle on the probability of having m incorrect predictions out of n observations ( n Choose m ). The length of the model is very straight forward and can ultimately be condensed down to number of non-leaf nodes in the tree. I can do this because every tree has access to the same data, so selection of features cancels out between trees - every tree has the same selection opportunity.

The only formulation I've seen for MML is based on Bayesian inference and probability distributions of classes in the feature space, so I need to have an expected distribution for the data. The features I use in computational chemistry are never Gaussian and many are binary features (is this chemical substructure present or not?). I could easily get away from Gaussian requirements by using a different distribution as the expected distribution and keeping things non-parametric (naive Bayes, for example), but not all features have the same distribution and worse yet, the distributions are quite often very ugly.

Share this post


Link to post
Share on other sites
Timkin    864
OT:

Quote:
Original post by Steadtler
Computer science graduate labs? Thats where I usually meet them :) You'ld be surprised... and no, engineering doesnt work.


You mean you've tried engineering them yourself? I think I saw a movie about that once that had Kelly Labrock in it! ;)

I must have been unlucky... the IT grad-chicks that I knew were nothing special and couldn't talk about anything beyond their thesis... now, the girls in atmospheric science and atrophysics, where I started out... now they were true geek-girls (and cute)!

Share this post


Link to post
Share on other sites
Steadtler    220
Quote:
Original post by Timkin
OT:

Quote:
Original post by Steadtler
Computer science graduate labs? Thats where I usually meet them :) You'ld be surprised... and no, engineering doesnt work.


You mean you've tried engineering them yourself? I think I saw a movie about that once that had Kelly Labrock in it! ;)

I must have been unlucky... the IT grad-chicks that I knew were nothing special and couldn't talk about anything beyond their thesis... now, the girls in atmospheric science and atrophysics, where I started out... now they were true geek-girls (and cute)!


Haha, no I meant that most women in engineering facs are bitter and resentful toward humanity in general, and men in particular. More than usual, I mean. And its mostly the guy's fault...

Yes, physics girls are hot, but usually too wierd for me.

Back on decision trees. Say I have a problem with a lot of items with the same variables that are part of different classes. Most DT would consider that as noise. Now I want to consider my set as a distribution instead. I guess what Im thinking about is a bayesian decision tree, but Im not sure its the same thing. Initially, all vars are free. I want to build the tree by finding the variable at each node that, by fixing its value, has the greatest effect on the posteriori distribution. Stopping criterion would be 1) negligable effect on posteriori distribution or 2)only one class in the posteriori distribution.

Does that make sense? Is that what they call bayesian DT? Know any work similar to this?

thanks guys.

Share this post


Link to post
Share on other sites
kirkd    505
I haven't done anything with Bayesian decision trees, but what you describe sounds like standard decision trees to me. At each node the induction algorithm tries to find the "best" possible split. How you define "best" is up to you, so my guess would be that you can easily create a split criteria that is related to your desired distributional effect.

-kirk

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this