OK, when I said, "knows nothing about the particular game," I was only partially right. The Brain engine itself knows nothing about what game it plays; rather, he consults an outside source to find a legal move, and to see if he''s won (basically, he calls GameGetValidMove() and GameCheckGameOver(), where these functions are defined outside of the scope of TheBrain).
This gets back to a basic question about AI. Would we be at all intelligent if we didn''t have sensors? I''m not going to divert this thread by following on with such a discussion, although maybe I will start another thread.
It seems to me that there is a flaw in your model.
quote:Original post by BeerNutts Basically, TheBrain plays a game by consulting with it''s past game experiences, and if it has been in the same state as the current game state, it makes a decision based on what move he did next, and what it resulted in. Once the game is over, TheBrain traverses each move in the current game, and marks the state based on whether he won or lost. This way, all the agent needs to know is what moves are valid, and if the game has been won or lost. It does not need to know any strategies related to a particular game, only how to move, and what consitutes winning.
This ignores the fact that there is another player involved. It is certainly possible that state XYZ occurs in games that I win and games that my opponent wins. Let''s assume that XYZ exists more often in games I win than games I lose. Assume that I have seen this state only once and I took the optimal action but I lost the game due to my opponents future actions then this is a false-negative training case. I.e., it suggests to the brain that I will lose from this state when based on likelihoods computed from the full game tree I should probably win.
Now, I''m certainly no AI expert by any stretch of the imagination, but I can''t help but think that the point you make, Timkin, applies equally to absolutely any learning system that models non-deterministic state transitions. Not only computerised learning suffers this way - human + animal learning work this way too. False associations are a common theme in many phobias, etc. I expect that the idea is, as with any statistical method, that repeated use will create a more representative data set which should ''iron out'' any blips in the data.
quote:Original post by Kylotan Now, I''m certainly no AI expert by any stretch of the imagination, but I can''t help but think that the point you make, Timkin, applies equally to absolutely any learning system that models non-deterministic state transitions.
Well certainly the issue of false-positives and false negatives is related to any classification/learning problem and yes, primarily those based on probabilities. What I was suggesting though was that simply choosing an action based on whether the last time the action was performed the game was won or lost, is not really an indicator of the viability of the action. This gets back to the problem of payoff in games. Payoff only occurs when a final state is reached (win or loss). We need some way of determining which states really led to the final state and which states were merely detours of no consequence. By marking all states visited (and all associated actions) as forming either a win or a loss strategy, we unduly reward or penalise those detour states (which may indeed be an important part of a different win/loss strategy)
Wonderful points brought up by both of you. Here''s what I was thinking when I created TheBrain. It is a crude attempt at modelling (at a very low-level) the experiences we as human''s face. If we touch a hot stove, we get burned, so we know not to touch it again. That was exactly what I wanted to model, except, in any board game, the brain doesn''t care about anything except winning and losing (like you said). But, realize the brain doesn''t just use the "last action" from a state. It uses all the actions performed from that state, and uses all the past moves as decisions. If all the actions from that state resulted in an undesireable end, then it will choose a new, unexplored action (if available). So, like Kylotan said, if played enough, then it should come out to pick the optimal solution.
Currently, with the tic-tac-toe game, starting from an empty brain (never played before), it takes about 400 games against a "random" computer enemy (makes random moves, except moves where he can win) before the brain will pick the optimal actions to win.
I haven''t thought about finding ways to remove "detour moves", but I really don''t care about it either. Like I said earlier, all the brain cares about is winning and losing, and if he takes a few detour moves that lead to a win, then he (and I) am happy.
My question is can't you way just the path from state to state and then "pay-off" all PATHS that generate the same win. More to the path that is shortest less to the long one, on the theory that longer paths are more likely to make mistakes or a comeback?
Smile, it won't let you live longer, but it is funny when they die with smile.
Edited by - adtheice on July 6, 2001 2:09:12 AM
Smile, it won't let you live longer, but it is funny when they die with a smile.
Talking more about the issue of pay-off (reward) - as this is fundamental to the performance of any learning system - here are two strategies you can try out.
Both are applied after a result is obtained.
The first pays off each action at an exponentially declining rate based on its distance from the final game state. This basically means that actions that set the stage for a win are given a small reward while actions that actually achieve the win are given a larger reward.
The second method requires computing the minimal plan (set of actions) from the start to the particular end state (after it is known) and then to reward the actual plan using a base amount that is reduced by how far away the actual plan was from the optimal plan. For instance, if the ratio of number of actions in the actual plan to the optimal plan is ''x'', then the payoff might be (for an exponentially declining payoff)
payoff = base_payoff * e1-x
Note: ''x'' is always >= 1 so the maximum multiplier on the base is 1 and the minimum is zero (when the ratio of moves is infinite).
Thanks for the suggestion. I particularlly like the idea of giving more points to states closer to the win. I wanted to keep the AI somewhat simple, and this option is still simple, but should prove more fruitful in the final product.
Guys, from what I''ve read so far you are talking about a well-established learning model named ''Reinforcement learning'' It''s main features is that the agent has no model of the world and the reward is ''delayed'' (i.e. at the end of the match) A famous agent for playing backgammon, Gerald Tesauro''s TD-Gammon, used a technique of reinforcement learning known as temporal difference learning to learn the rules of backgammon and defeat world-class players.
Yes, we are talking about reinforcement learning. In particular, we are discussing ''supervised learning''. Supervised learning is applicable to many different areas of AI, not just game playing.
Just to sort out terms, supervised learning is not the same as reinforcement learning, nor is one distinctly a subset of the other (though they are related). Supervised learning occurs when a right answer is known and can be presented to the system so the system can adjust itself towards that answer, i.e. the system answers 7, the right answer is 9 and the system adjusts itself so that next time it will answer closer to 9 than 7. Reinforcement learning occurs when the exact answer isn''t known only a general sense of how right the systems answer is, i.e. the system answers seven and the feedback is "not bad, 8.5 out of ten". Unsupervised learning occurs when the system is set up so that each part of it adjusts itself according to the values of the system around it without any kind of explicit feedback from an outside teacher. Such as Hebbian learning in an ANN where the synchronocity of the firing of two connected neurons affects the weighting of the connection between them.