To Timkin and anybody else interseted in Reinforcement Learning in Pacman
I was going to write a nice email asking some interesting questions, but I see I''ve been beaten to the punch!
On the issue of training multiple ghosts, I agree with Alex on this one. In one iteration of your algorithm you actually have 4 instances (if you have 4 ghosts) of a state transition and reward. If the ghosts have no information about the state of other ghosts, then training 4 ghosts is equivalent to training a single ghost and using the resulting policy for any new ghosts added to the game. If, however, each ghost has information about the state of the other ghosts, then you are in a better situation. However, you need to make sure that the reward offered to each ghost takes into account the value of that ghost driving the pacman toward another ghost. This is not a trivial thing to evaluate because how do you correctly deduce which ghost caused the pacman to head in that direction. Reasonable approximations could be made though. Once you have this, then training four ghosts together should deduce an optimal policy that incorporates cooperation between the ghosts.
Timkin
On the issue of training multiple ghosts, I agree with Alex on this one. In one iteration of your algorithm you actually have 4 instances (if you have 4 ghosts) of a state transition and reward. If the ghosts have no information about the state of other ghosts, then training 4 ghosts is equivalent to training a single ghost and using the resulting policy for any new ghosts added to the game. If, however, each ghost has information about the state of the other ghosts, then you are in a better situation. However, you need to make sure that the reward offered to each ghost takes into account the value of that ghost driving the pacman toward another ghost. This is not a trivial thing to evaluate because how do you correctly deduce which ghost caused the pacman to head in that direction. Reasonable approximations could be made though. Once you have this, then training four ghosts together should deduce an optimal policy that incorporates cooperation between the ghosts.
Timkin
Couldn''t the reward function just be a constant negative value for any state where pacman is not caught? Then for any given episode, to try to maximize reward the ghosts would want to catch pacman as quickly as possible.
Alex,
yeah, I did have to code the whole GUI myself (unfortunately) that meant that I had far less time to spend on the AI
yeah, I did have to code the whole GUI myself (unfortunately) that meant that I had far less time to spend on the AI
quote:Original post by Anonymous Poster
Couldn''t the reward function just be a constant negative value for any state where pacman is not caught? Then for any given episode, to try to maximize reward the ghosts would want to catch pacman as quickly as possible.
Not really. What we''re discussing here is a form of the credit assignment problem . How does one reward the actions that set the stage for the winning action. Since this is a Markov Decision Problem, the sequence of decisions taken is as important as the last decision taken, although obviously, due to the number of paths that might lead to the successful goal state, the further back one goes in the set of actions, the less important each move was.
But then this is not always true either, only generally true. What if there is a opportunity to force the pacman into a corner that has only one entry and exit point. The move that forces the pacman through that state would be considered more important than some moves within that enclosed area, because it created a scenario where it was easier to attain the goal (by fencing in the set of possible states the pacman could choose from).
In some approaches to determining the value of an action at the current time, the future is predicted either to the end of the game or to some finite horizon. For example, in tic tac toe, you can easily evaluate all of the possible endings given a current move. The proportion of those that lead to a win gives the relative value of the current move. In chess its much harder, since you cannot evaluate all possible endings (except for endgames, which are worked out offline), so you need to do something else. I''m not an expert on chess evaluation functions, although I know we have a few people around here that might like to chime in, if more info is desired.
Choosing the action that maximises this local value approximation is a good strategy to follow. However, a better one is to choose the action that also minimises the opponents expected value. This equates to reducing their access to winning states, by fencing them into a section of the state space with low success density.
Okay, I''ve rambled a bit too much now... there are some interesting solutions (attempted solutions) to the credit assignment problem and mathematically at least, the optimal solution is possible... but it involves computation of the Kolmogorov Complexity for the state space... and that is not tractable.
Cheers,
Timkin
Isn''t avoiding the credit assignment problem one of the benefits of reinforcement learning? If all states where pacman is not caught are given a negative value and the state where pacman is caught is given a positive value, then the reinforcement learning algorithm should be able to determine that a state that came before the capture of pacman, even though it had a negative reward value, can lead to the capture of pacman and the positive reward that comes with it. States or actions that almost always lead to the capture of pacman, even though they would have a negative reward value, would almost always lead to the positive reward and so the algorithm could determine that these states have a high value.
Trying to arbitrarily assign credits to states that the programmer thinks are good might result in faster learning, but then the agent isn''t really learning as much because the programmer has already given it a partial solution.
Trying to arbitrarily assign credits to states that the programmer thinks are good might result in faster learning, but then the agent isn''t really learning as much because the programmer has already given it a partial solution.
quote:Original post by Anonymous Poster
Isn''t avoiding the credit assignment problem one of the benefits of reinforcement learning?
My apologies... you are correct. I''m rather sleep deprived at the moment (actually, continuously), so I wasn''t thinking very clearly when I wrote that yesterday and I was thinking of the wrong problem. I''m not convinced though that a constant negative value is appropriate in the situation of multiple ghosts, since the actions of one ghost - while not giving it reward - could result in the success of another ghost, even though that ghost''s action wouldn''t normally catch the pacman. I guess the solution to that would be to share the reward between all ghosts... but then that''s not individual learning and cooperative behaviour, that''s just a single ghost in 4 places at once.
I''m going to have to give this more thought...
Timkin
To get the ghosts to learn to cooperate, they''d need to know each other''s positions, just like they know the position of pacman. And their goal is to catch pacman as fast as possible, so the only really relevant thing to the reward of any ghost is that not having caught pacman is bad, catching pacman is good, so they should all be getting rewarded the same way. Each ghost could have its own internal representation of the world kept independent of the representations build by other ghosts, so that the ghosts can develop individual personalities that work together to catch pacman. But now the task to be learned is much more difficult, since each ghost has to learn to act based on the positions of the other ghosts as well. Also getting the learning to generalize could be very difficult, but it usually is, so that''s nothing new. And there is the possibility that two ghosts learn to cooperate, and the other two ghosts go and wander around randomly. I think this would be an interesting result...social loafing in pacman ghosts...
Actually, for true cooperation, each ghost would not only need to know the state of the other ghosts, but also the decision made by the other ghosts in the current iteration... and you can see how this is a catch 22 scenario: I can''t decide until you''ve decided, but you can''t decide until I''ve decided. There are solutions, but they generally involve a deliberative negotiation between the agents...
Timkin
Timkin
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement