I'm working on yet another probability problem which I've just begun to tackle. I thought I would try to describe it here to help get me started in the right direction. Here's a general depiction of the problem.
I have a collection of decision trees that were developed to model a particular dataset. The decision tree generation algorithm uses evolutionary computation to develo the trees, so there is a heavy random component as well as selection pressure for the trees to find useful variables in modeling. I control the size of the trees so that they don't become excessively large, and presumably the variables are useful rather than just arbitrarily splitting the dataset as might happen with excessively large trees.
If I run this algorithm 1000 times, I get 1000 different trees due to the random nature. Occassionaly, I might see the exact same tree,some very similar trees, or even very different trees. The first question I asked was, if I take these 1000 trees and look for the variables that were chosen over the whole population, can I find variables that occur more often than would be expected by chance. The way I did this is as follows:
n = number of variables to choose from in the original dataset
n_dt = total number of variables chosen across all decision trees - esentially the sum of all nodes across trees
f = frequency of chosing a particular variable
In words, what is the random probabability that a particular variable was chosen f times given that I chose n_dt total times from n possible variables. Which boils down to this type of calculation:
P= nCx * p^x * q^(n-x)
or in this case:
P = n_dtCf * (1/n)^f * ((1- 1/n)^(n_dt-f))
This equation works perfectly in simulations with large enough sampling.
Not surpisingly, what I've noticed is that often some variables occur together showing the co-operative and non-linear nature of the data. So now I ask a new question, given my 1000 evolved trees, what is the probability that two variables occur in the same tree in the same relative position to each other? The second question seems quite complex and will need information on the size of the tree in which they occur, the sizes of all other trees in the sample, and all the possible ways these two variables could be positioned relative to each other in all possible trees. To simplify the question and the calcualtions, I think I can instead ask what is the probability that two variables occur in the same tree, regardless of relative position? This would effectively given a maximum probability value to the original question.
I'm just starting to contemplate this one, so any advice/insight is appreciated.
Edit: After thinking about it, is it really this simple?
np = total number of possible pairs from the original dataset (n^2)
tp_o = total pairs observed across all decision trees
p = total times this particualr pair was observed
P = tp_oCp * ( 1/np ) ^ p * ( 1 - 1/np ) ^ ( tp_o - p )
Now, how about getting the specific relative position involved?
Yet another probability question
No replies to this topic