#### Archived

This topic is now archived and is closed to further replies.

# URGENT: Need Help with Reinforcement Learning

## Recommended Posts

I am badly stuck, and if anyone can help, I'll be eternally grateful. OK, I am working on a pacman game, I have almost done the GUI (interface), and now I am working on the AI. I am using Reinforcement learning as AI for the ghosts. So at the beginning, the ghosts are dumb and just wonder around the maze randomly, but as they are trained they become more clever and learn how to catch pacman. I am using Sarsa algorithm for learning and it states: Q(s, a) <-- Q(s, a) + alpha[r + gammaQ(s', a') - Q(s, a)] Q(s, a) is the state-action value Q(s', a') is the NEW state-action value alpha is the learning rate gamma is the disocunt rate r is the reward recieved My questions are (and lets just pretend we are using a very simple 4x4 maze): a) What would the value of Q(s, a) and Q(s', a') be at the very beginning when the game starts? b) The first time that pacman is caught, how would the values in this formula would change? Thank you in advance [edited by - pars on March 5, 2003 3:15:19 PM]

##### Share on other sites
quote:
Original post by pars
My questions are (and lets just pretend we are using a very simple 4x4 maze):

a) What would the value of Q(s, a) and Q(s'', a'') be at the very beginning when the game starts?

You should initialise your value functions, Q(s,a), to arbitrary values for all s and a.

quote:
Original post by pars
b) The first time that pacman is caught, how would the values in this formula would change?

You need to offer a reward at each iteration of the algorithm, even if it is zero or negative. This will ensure that inappropriate actions are forgotten and good actions are promoted.

As to how the values change, well that seems quite obvious from the formula! Just plug in the values, which should all be known to you and apply the result as the new value for the action Q(si,aj) (where action aj was taken in state si at this iteration of the algorithm.

Just a point though... I''m not convinced that you are going to get good results using this algorithm, because at each iteration the goal state is moving, and hence the value functions are changing relative to the states. You would need a LOT of training trials to ensure that you had sufficient trials for the pacman in each possible goal state.

If you do get good results, please share them with us as I would be very interested to view the learning statistics.

Cheers,

Timkin

##### Share on other sites

quote:
Original post by Timkin
You should initialise your value functions, Q(s,a), to arbitrary values for all s and a.

Does this also apply to Q(s', a') ? At the very beginning, what is this value initialised to? zero? Also, the first time pacman is caught, how does this value change? (I hope what just said make sense).

quote:
Original post by Timkin
If you do get good results, please share them with us as I would be very interested to view the learning statistics.

Sure I will let you know of the results hopefully in 3 or 4 weeks time. The thing is I have seen sarsa to work before on a cat and mouse game. I am just hoping that everything will workout smoothly.

Thanks again

[edited by - pars on March 5, 2003 7:11:13 PM]

##### Share on other sites
quote:
Original post by pars
Does this also apply to Q(s'', a'') ? At the very beginning, what is this value initialised to? zero? Also, the first time pacman is caught, how does this value change? (I hope what just said make sense).

No offense intended, but your response leads me to question whether you understand what Q() is!? In case you are uncertain, Q() is a scalar function of two variables, being the possible states of the domain and the possible actions of the agent (in this case the pacman). The value of Q for any given action a and state s is a number that represents the value of executing action a in state s.

Since this is a learning scenario, we obviously don''t know these values a priori (beforehand). Within the SARSA algorithm, the agent actually performs two actions. The first, a, is executed in state s. This causes the agent to arrive in state s'' and receive reward r. The agent then performs action a'' from state s''. Both the states s and s'' have associated with them a value, given by Q(s,a) and Q(s'',a''). At the beginning of the algorithm, these values are arbitrary. As the agent moves around the environment, these values will either be reinforced or will decay.

Just another point (to ensure clarity). Q() is used to select the action ''a'' in state ''s'' (or a'' in state s'') according to some policy, p. So, given a starting state, a reward function and a policy, the agent can learn a value function for state-action pairs which enables it to maximise its reward relative to p.

Cheers,

Timkin

##### Share on other sites
quote:
Original post by Timkin
No offense intended
Timkin

None taken, I understand what sarsa is and what each element in the algorithm means. I have found a lot of sources that explain exactly the same thing. But they fail to provide any practical example. Its not that easy (at least for me) to use algorithms like this unless I see and example on how its used.
But anyway thanks for the reply. Just one more question:

What do you think is the best (and simplest) way to represent each state in the game, i mean lets say for a simple maze like this:

------- ------- -------
| Ghost | empty | empty |
|-------|-------|-------|
| empty | -wall-| empty |
|-------|-------|-------|
| empty | empty |pacman |
------- ------- -------

Thanks

[edited by - pars on March 8, 2003 1:15:46 PM]

[edited by - pars on March 8, 2003 1:17:13 PM]

[edited by - pars on March 8, 2003 1:17:41 PM]

##### Share on other sites
I means should a string array for example be used to represent the state s in the code?

For instance in the above example something like this: [ghost,_,_,_,wall,_,_,_,pacman]

thanx

##### Share on other sites
I mean should I use a string array for example to represent the state s in the code?

For instance in the above example something like this: [ghost,_,_,_,wall,_,_,_,pacman]

thanx

##### Share on other sites
That''s not such an easy question to answer, since there are quite a lot of representations that would work. The most obvious that I can think of off the top of my head would be to store the level as an array, where each cell of the array represents a tile of the board and the contents of the cell is a structure defining the contents of the tile (ghost, pacman, wall, empty). Presumably you should permit ghost and pacman to occupy the same state so that you can easily check for a kill condition (any state in which they occupy the same tile)!

An alternative would be to use a structure that includes four pointers and a cell ID to represent a tile. Each pointer points to the structure of a neighbouring cell. If there is no neighbouring cell in that direction (because of a wall) then set the pointer to null. That way you only define the cells in the game that the pacman and ghosts can enter. A state of the game would be the union of this set of structures and the known states of the agents in the game.

A third alternative - and one that you might later reuse in other games - is to design a quad tree structure where the leaves of the tree are cells of the board that can be entered. If the game board doesn''t change shape then as above, the only thing that differentiates states of the game are the states of the agents in the game.

This would mean that you could represent a game state by the cell ID and facing of the pacman and the cell ID''s and facings of each of the ghosts. You would have to allow for an indicator that identified dead ghosts so that you could maintain a constant length state vector.

I hope this helps.

Cheers,

Timkin

• ### Forum Statistics

• Total Topics
628308
• Total Posts
2981974

• 9
• 13
• 11
• 12
• 11