Jump to content
  • Advertisement
Sign in to follow this  
DeepButi

Reinforcement learning

This topic is 4936 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have a bot playing a local trick-tacking card game (see here -> Botifarra for exact rules, or play it on-line with humans and/or bots here). Once a game played, I can examine it as a perfect information zero-sum game with a relatively low (3.3E16) game space. So, I can a) evaluate each card played (against the optimum move), b) which one would have been the optimum move and c) evaluate all actual strategies. In other words, once all card played and assuming perfect opponents and perfect partner, I can say for every time the bot played a card: a) bot played card A, which puts it on a tree n points below optimum b) the best move would have been card B (not necessarily unique) c) of the actual programmed policies, P (not necessarily unique) would have selected card B I'm plannig now to add a learning module using Reinforcement Learning techniques to make it better. First step would be ordering policies according to values obtained in c) for several tipified class of games. What am I looking for is ... how to create new policies? how to explore? As of now, policies are hard coded, they stand for good players best practices. Even if I can combine policies better, I will never be able to find better ways to do. I can find lots of AI research on learning but I've not been able to find litterature about how to define the policies (probably because it's very game dependent). Also game forums are mainly devoted to graphics, (MMO)RPG and the like, but not card/classic board games. Any help, links, ... would be greatly appreciated

Share this post


Link to post
Share on other sites
Advertisement
My only suggestion would be to look into value iteration reinforcement learning instead of policy iteration reinforcement learning. In value iteration, your policy is typically picking the action with the highest utility value, then updating your utility values based on experience. As your utility estimation becomes more accurate, your policy approaches optimal.

One additional caveat is that you need to pick a random action every now and then to make sure your whole action space is explored adequately. This is called an epsilon-greedy policy.

Share this post


Link to post
Share on other sites
Ok, you generate a giant tree with all possible moves, and there uv (utility value).

At the end of the game, start with the end value (its spot on the tree where it is (at the end of the game)), you then modulate that
(ie. make it lower), with how good the final state is.

You then go up the tree, diluting how much you change each value, based on how far it is (ie. the end is n generations away), all the way to the top.

When making a decision, go and use the uv of each choise to affect the porbability of each choise being chosen, it will eventually cause all the bad choises's porbabilities to fall through the floor (and never be picked again), and all the good choises probabilities to go through the roof (it will be picked very often).

From,
Nice coder

Share this post


Link to post
Share on other sites
Quote:
Original post by cwhite My only suggestion would be to look into value iteration reinforcement learning instead of policy iteration reinforcement learning. In value iteration, your policy is typically picking the action with the highest utility value, then updating your utility values based on experience. As your utility estimation becomes more accurate, your policy approaches optimal.

That's currently my approach for the existing policies. As I can evaluate it's exact value for a played game, with a large number of games I have a good estimated utility, so I can pick up the probably most valuable at any given time.

Quote:
Original post by cwhite One additional caveat is that you need to pick a random action every now and then to make sure your whole action space is explored adequately. This is called an epsilon-greedy policy.

Here I come into trouble. A random action is worthless unless I can categorize it and convert it into a policy. Playing a random card means nothing in terms of learning, I must use a random policy to select a card, but I don't know how to define policies.
It's not bridge, but let's use it as an example. In bridge some policies would be "try an impasse", "try an expasse", "play trumph to clean up opponents", "play a card of a suit in wich my partner has the strongest one" ... and so on.
When this policies are already evaluated (previous step) ... how do I try a "random policy"? How do I define new policies? If I have to hard code them, then I can evaluate them with the previous step, so random pick ups will be useless. Take into account than I'm evaluating all policies, not only the ones played. The evaluating step is done in an off-line process specifically designed for learning, not in actual playing.
(There is one colateral problem here because the time to evaluate the whole tree is sometimes too big. My alpha-beta is usually performing reasonably well, but too slow still).

Here is where I'm stuck: how do I create new policies without hard-coding them?

Nice Coder,
if I understand it correctly, with your approach I have exactly the same problem: only existing policies would be evaluated.

Thanks for all your answers.

Share this post


Link to post
Share on other sites
What i was saying was, FOR EACH POSSIBLE COMBINATION. (sorry bout that, i'm just a lil tired, see?) It would eventually do the best choises, more then average. (and usually nearly exclusively after long enough).

There are no policies, all it is is states and moves. (and all possible ones at that).

From,
Nice coder

Share this post


Link to post
Share on other sites
Nice Coder,
I must admit I cannot understand it. Perhaps I didn't explain properly the situation.

4 players, 48 (a full spanish deck) cards are dealt. In real play each player only sees their own cards, but on the learning module, all cards are known and I apply a NegaScout with all sorts of prunning, ordering, hash-table etc to analyze the given hand. The game space of this complete information, zero-sum game, is aprox. 3.3e16 so usually I get a good time response.
For each time the robot would have played, I analize all existing policies (max value before and after play) so I have the real utility value of the policy for the GIVEN HAND.

I repeat the process with a lot of hands and I end up with a good evaluation of each policy for different situations (who trumps, number of trick, and others).

But obviously I cannot analyze every possible hand (48! different hands, each one with a 3.3e16 game space is well out of reachable analysis).

When analyzing a given hand, it's no point on knowing that playing card "A" is a good move, because this means nothing for a different hand. In other words, "states" and "moves" cannot be considered from a single hand point (or at least, I cannot see how to do it). I must evaluate "what let the robot to play card A" (a policy) so in a different hand, if the conditions are met, this policy would be used according to the utility value previous analysis.

But if I'm able to create a new set of fixed conditions, I will be hard coding them and I was looking for some general approach.

mmm ... I'm really confused :( ... but anyway thanks for your answer, I'm reading again and again it trying to understand ;).

Sutton, Barto & Nice Coder are my dream partners now :) :)

Share this post


Link to post
Share on other sites
Ok... i'm not sure what to make of dream partners...

Perhaps you could use rl per hand?

Or perhaps rl isn't the correct thing to use for this situation.

Here is how i would have done this (if you don't want to hear this, just move on to the ending.)

1. Calculate the probability of another hand being better then yours (formula:

You have 48 cards,
You have 5 cards, you can take them out,
You now have 43 cards.

The probability of your opponent (always assume one, because you don't know any more cards), having a hand is 1/43 * 1/42 * 1/41 * 1/40 * 1/39 or 8.6571272050568696e-009.

Now, you precalculate how good each hand is (by your eval function).

Then you work out how many cards beats each other card. (so eg. one hand beats another hand, ect.)

Then once you've got that, you calculate the probability of a hand beating you hand as
X/M
Where X is the number of beating hands, and M is the total hands.

Then once you've got that, you go and multiply that with the above constant (the one thats about 8e-9), and multiply that with the number of players (4 in this case), and you've got an aproximation of how probabile it is that a hand will beat your hand.

With that, you can use that to make bets, or do other card related things that are in the game.

(the website has little information, how exactly do you win? and what do you win, when you win?)

HTH
From,
Nice coder

Share this post


Link to post
Share on other sites
Nice Coder,
it's a trick taking game, similar to bridge (rules here). A hand is 12 cards each player. Comparing hands, betting and so on has no meaning here, what's important is wich card do you play and in wich order to maximize the points you get. With a hand worse than your opponents hand your goal would be losing as few points as possible.

Building all possible hands is not possible (36! is 3.7e41 ... out of any consideration). As play goes, probabilities on each player holding each card can be estimated and expectiminimax/Montecarlo methods used, but the numbers are still too hugue (or my minimax too bad :( ).

... or ... alternatively, my english is not good enough and I don't get the point.

Anyway, thanks a lot for your interest

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!