Archived

This topic is now archived and is closed to further replies.

pars

To Timkin and anybody else interseted in Reinforcement Learning in Pacman

Recommended Posts

pars    122
I have been working on implementation of Reinforcement Learning in Pacman for my final year project. You might remember that I asked a few questions in regard to this topic a few months ago. Since Timkin was interested to know if RL in pacman works or not, I decided to post this topic. - I can say that my project was a success. As you can imagine, at the beginning, the ghosts are dumb, and just move around randomly, but after going through a series of training episodes, their behaviour start to improve, and if enough training is done, they will manage to catch Pacman efficiently. - The training however took ages. in order to train the ghosts for 1 million episodes, it took around 20 hours. However I saved all the data in a file, so that they could be loaded into memory the next time the game is run. Therefore there would be no need to repeat the training. - In the training, I only used one ghost. Not because it was not possible to include all the four ghosts, but in order for RL to be effective, The training would have to be done for a much higher number of episodes and so it would take ages. Also I ommited the use of power capsules, because when they are eaten by Pacman, the assumption of the ghosts about their world drastically changes, and would make the learning much more complicated. Frankly, if I had enough time, I would investigate this issue further. If anyone is interested, I can put the results for this project on the web, together with all the details, but only after my exams (they start tomorrow).

Share this post


Link to post
Share on other sites
pars    122
Oh, and I would like to thank anyone who answered my questions when I had problems

[edited by - pars on May 1, 2003 7:37:39 AM]

Share this post


Link to post
Share on other sites
kirkd    505
I for one would like to see your results. Of course I''m sure we''ll all wait patiently while you concentrate on exams.......are they up yet?......how about now?....now?....

-Kirk

Share this post


Link to post
Share on other sites
Kylotan    9982
I''d be interested too, especially in the data representation (since I''d argue that 90% of the battle with creating any AI system is finding a decent representation of what you''re modelling).

[ MSVC Fixes | STL Docs | SDL | Game AI | Sockets | C++ Faq Lite | Boost
Asking Questions | Organising code files | My stuff | Tiny XML | STLPort]

Share this post


Link to post
Share on other sites
Mr Nonsense    126
quote:
Original post by Kylotan
(since I''d argue that 90% of the battle with creating any AI system is finding a decent representation of what you''re modelling).



Nah...more like 99%


Share this post


Link to post
Share on other sites
alexjc    457
I''m curious too Two things:

1) Do you think the slow learning due to the fact that pacman is not a static MDP - depending how you model it? How much of the dynamicness of the game did your representation manage to "factor out"?

2) What actual algorithm are you actually using to compute the state values?

Alex



Artificial Intelligence Depot - Maybe it''s not all about graphics...

Share this post


Link to post
Share on other sites
pars    122
alexjc,

I have used Sarsa algorithm,

I don''t fully understand your first question. Well I suppose the slow learning is prob b/c of the high number of states in the game, It has a 19x19 maze, and If we dont count the walls, there should be around 60,000 possible states, with pacman and only one ghost present in the maze. Now each state has 4 possible Q values, so there would be around 240,000 possible Q values. For the learning to be effective each value needs to be updated a few times. So I suppose the ghost has to be trained for around a million times.

As for training all the four ghosts together, well, I think you can guess how long that will take.

Share this post


Link to post
Share on other sites
alexjc    457
pars,

For the algorithm, I''d suggest trying the Monte Carlo approach. TD(0) algorithms - like sarsa - take very long to propagate the state values as they use a backup of depth 1. With a Monte Carlo algorithm, you can do arbitrarily deep backup, and propagate the reward much quicker. It seems pacman is well suited to the episodic learning scheme anyway. As a bonus, you can give hints to the episode generation policy to improve the learning.

As for the first question, I wasn''t sure if you''d modelled ALL the states, but it seems you have. If you hadn''t, I was wondering if the problem was actually "solvable" using RL algorithms (would the probabilities be stationary).


I count 130,321 states (19x19 for ghost * 19x19 for pacman) ignoring obstacles: more complex than I would have thought! That said, you don''t really need the Q-values as you know the neighbouring states. Since you have part of the world model (the state transitions) learning the state values V(s) is possible, and would be at least 4x quicker.

As for training multiple ghosts, I think it would in fact be easier. The ghosts in my pacman don''t seem to pay attention to each other, so it''s the same state-space. In fact, it''d be like having multiple agents exploring the same state space, so it''d learn quicker too.


Anyway, sounds like great fun! Did you code the pacman yourself or use an existing codebase?



Artificial Intelligence Depot - Maybe it''s not all about graphics...

Share this post


Link to post
Share on other sites
Timkin    864
I was going to write a nice email asking some interesting questions, but I see I''ve been beaten to the punch!

On the issue of training multiple ghosts, I agree with Alex on this one. In one iteration of your algorithm you actually have 4 instances (if you have 4 ghosts) of a state transition and reward. If the ghosts have no information about the state of other ghosts, then training 4 ghosts is equivalent to training a single ghost and using the resulting policy for any new ghosts added to the game. If, however, each ghost has information about the state of the other ghosts, then you are in a better situation. However, you need to make sure that the reward offered to each ghost takes into account the value of that ghost driving the pacman toward another ghost. This is not a trivial thing to evaluate because how do you correctly deduce which ghost caused the pacman to head in that direction. Reasonable approximations could be made though. Once you have this, then training four ghosts together should deduce an optimal policy that incorporates cooperation between the ghosts.

Timkin

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
Couldn''t the reward function just be a constant negative value for any state where pacman is not caught? Then for any given episode, to try to maximize reward the ghosts would want to catch pacman as quickly as possible.

Share this post


Link to post
Share on other sites
pars    122
Alex,

yeah, I did have to code the whole GUI myself (unfortunately) that meant that I had far less time to spend on the AI

Share this post


Link to post
Share on other sites
Timkin    864
quote:
Original post by Anonymous Poster
Couldn''t the reward function just be a constant negative value for any state where pacman is not caught? Then for any given episode, to try to maximize reward the ghosts would want to catch pacman as quickly as possible.


Not really. What we''re discussing here is a form of the credit assignment problem . How does one reward the actions that set the stage for the winning action. Since this is a Markov Decision Problem, the sequence of decisions taken is as important as the last decision taken, although obviously, due to the number of paths that might lead to the successful goal state, the further back one goes in the set of actions, the less important each move was.

But then this is not always true either, only generally true. What if there is a opportunity to force the pacman into a corner that has only one entry and exit point. The move that forces the pacman through that state would be considered more important than some moves within that enclosed area, because it created a scenario where it was easier to attain the goal (by fencing in the set of possible states the pacman could choose from).

In some approaches to determining the value of an action at the current time, the future is predicted either to the end of the game or to some finite horizon. For example, in tic tac toe, you can easily evaluate all of the possible endings given a current move. The proportion of those that lead to a win gives the relative value of the current move. In chess its much harder, since you cannot evaluate all possible endings (except for endgames, which are worked out offline), so you need to do something else. I''m not an expert on chess evaluation functions, although I know we have a few people around here that might like to chime in, if more info is desired.

Choosing the action that maximises this local value approximation is a good strategy to follow. However, a better one is to choose the action that also minimises the opponents expected value. This equates to reducing their access to winning states, by fencing them into a section of the state space with low success density.

Okay, I''ve rambled a bit too much now... there are some interesting solutions (attempted solutions) to the credit assignment problem and mathematically at least, the optimal solution is possible... but it involves computation of the Kolmogorov Complexity for the state space... and that is not tractable.

Cheers,

Timkin

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
Isn''t avoiding the credit assignment problem one of the benefits of reinforcement learning? If all states where pacman is not caught are given a negative value and the state where pacman is caught is given a positive value, then the reinforcement learning algorithm should be able to determine that a state that came before the capture of pacman, even though it had a negative reward value, can lead to the capture of pacman and the positive reward that comes with it. States or actions that almost always lead to the capture of pacman, even though they would have a negative reward value, would almost always lead to the positive reward and so the algorithm could determine that these states have a high value.

Trying to arbitrarily assign credits to states that the programmer thinks are good might result in faster learning, but then the agent isn''t really learning as much because the programmer has already given it a partial solution.

Share this post


Link to post
Share on other sites
Timkin    864
quote:
Original post by Anonymous Poster
Isn''t avoiding the credit assignment problem one of the benefits of reinforcement learning?


My apologies... you are correct. I''m rather sleep deprived at the moment (actually, continuously), so I wasn''t thinking very clearly when I wrote that yesterday and I was thinking of the wrong problem. I''m not convinced though that a constant negative value is appropriate in the situation of multiple ghosts, since the actions of one ghost - while not giving it reward - could result in the success of another ghost, even though that ghost''s action wouldn''t normally catch the pacman. I guess the solution to that would be to share the reward between all ghosts... but then that''s not individual learning and cooperative behaviour, that''s just a single ghost in 4 places at once.

I''m going to have to give this more thought...

Timkin

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
To get the ghosts to learn to cooperate, they''d need to know each other''s positions, just like they know the position of pacman. And their goal is to catch pacman as fast as possible, so the only really relevant thing to the reward of any ghost is that not having caught pacman is bad, catching pacman is good, so they should all be getting rewarded the same way. Each ghost could have its own internal representation of the world kept independent of the representations build by other ghosts, so that the ghosts can develop individual personalities that work together to catch pacman. But now the task to be learned is much more difficult, since each ghost has to learn to act based on the positions of the other ghosts as well. Also getting the learning to generalize could be very difficult, but it usually is, so that''s nothing new. And there is the possibility that two ghosts learn to cooperate, and the other two ghosts go and wander around randomly. I think this would be an interesting result...social loafing in pacman ghosts...

Share this post


Link to post
Share on other sites
Timkin    864
Actually, for true cooperation, each ghost would not only need to know the state of the other ghosts, but also the decision made by the other ghosts in the current iteration... and you can see how this is a catch 22 scenario: I can''t decide until you''ve decided, but you can''t decide until I''ve decided. There are solutions, but they generally involve a deliberative negotiation between the agents...

Timkin

Share this post


Link to post
Share on other sites