Jump to content
  • Advertisement
Sign in to follow this  
cossie

Question on states / state values in Q-Learning (and RL in general)

This topic is 3579 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I'm trying to get to grips with the basics of RL, I've been doing my best to get through Sutton and Barto's "Reinforcement Learning: An Introduction" online book. But I'm having a bit of difficulty with states. One of the examples they go into is Tic-Tac-Toe it's more a comparison of RL vs Search vs Evolutionary Algorithms but I've been trying to work through that. Are all the states at the start of the game "empty" states because they are neither an 'X' nor an 'O'? And a state could be consider a position in the board? And each state has a Q-value for the possible actions that could be taken from that state right? Does that mean in a game like Tic-Tac-Toe a possible Q-value table would be something like:
State | Possible Action 
 (0,0)| Place X / Leave Blank 
 (1,1)| Place X / Leave Blank
 (1,2)| Place X / Leave Blank
 (1,3)| ...
  ... | ...
or is it:
State | Possible Action 
 (0,0)| (1,1), (1,2), (1,3), (2,1), (2,2), ...
 (1,1)| (1,2), (1,3), (2,1) ...
 (1,2)| (1,1), (1,3), (2,1) ...
 (1,3)| ......basically all the empty squares but the current one
And so, (I apologise if I make a mess of my attempt at an explanation, I try hard but I'm not that smart!) if 'X' plays first, it would be in the state (0,0) (an initial state, i.e. not being on the board) and then selecting an action ("place an X in (1,1)), which leads to:
  1  2  3
1 _X|__|__ 
2 __|__|__ 
3   |  | 
Would it be correct to say that the new state (s') for X is now (1,1)? And because they haven't lost or won the reward = zero? So when updating the Q-value The next move is then made by 'O:
   1  2  3
1 _X|__|__ 
2 __|_0|__ 
3   |  | 
Do the states for 'O' contain information about 'X' positions or is that sort of thing taken care of because it will either make an illegal move in the game (trying to go place an 'O' in an occupied 'X' square) or it will have lost the game and through the reward value (-1) will be probabilisically less likely to chose a losing move in future games? I'd be really grateful for some advice on this and I'd also like to thank anyone who manages to reads this humungous post!

Share this post


Link to post
Share on other sites
Advertisement
The state is the entire state of the board, not the state of an individual square. There is only one starting state, which is the empty board. There are 9 valid moves from this state.

Share this post


Link to post
Share on other sites
Thanks for your reply Vorpy, I can see I had a hole in my fundamental understanding. And it's actually raised another question, but this is about the number of possible Q values that would be needed.

I understand a Q value to be a value associated with a state-action pairing, but I'm unsure as to how to calculate the number of values you'd need for a game of tic-tac-toe.

I've never been great with permutations, so I'd appreciate a bit more help if someone's kind enough to help me?

Thanks,

Cossie

Share this post


Link to post
Share on other sites
Each square has 3 possible values and there are nine squares. There are 3^9 or 19683 permutations. Many of these are not reachable in an actual game (for example, boards where both X and O have 3 in a row).

Since the number isn't that high, you could probably just make a table of that size and you'll just have entries that are never visited. Or use a hash table/map/associative container of your choice.

You can encode the entire state of a tic tac toe board into a 16 bit integer. A simple way to do this is to assign empty, X, and O the values 0, 1, and 2 (in any order you'd like), then set an accumulator to 0. For each space in the board, multiply the accumulated value by 3 and add the value of that space. Or even encode the entire board into a 32 bit integer by simply using 2 independent bits to store the state of each square.

There are far fewer possible states if you account for symmetry. With symmetry, there are only 3 states after the first move (corner, side, and center). But if the goal here is to find weaknesses in an imperfect player, then you wouldn't want to exploit symmetry because the imperfect player may have asymmetric weaknesses.

Share this post


Link to post
Share on other sites
Thanks a million Vorpy. I had never considered the fact there may be states that are not reachable.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!