How to account for position's history in transposition tables

Started by
16 comments, last by alvaro 10 years, 1 month ago

I'm currently developing a solver for a trick-based card game called Skat in a perfect information situation. Although most of the people may not know the game, please bear with me; my problem is of a general nature.

Short introduction to Skat:
Basically, each player plays one card alternatingly, and three cards form a trick. Every card has a specific value. The score that a player has achieved is the result of adding up the value of every card contained in the tricks that the respective player has won. I left out certain things that are unimportant for my problem, e.g. who plays against whom or when do I win a trick.
What we should keep in mind is that there is a running score, and who played what before when investigating a certain position (-> its history) is relevant to that score.

So that everyone can imagine better how the score develops thoughout a Skat game until all cards have been played, here's an example. The course of the game is displayed in the lower table, one trick per line. The actual score after each trick is on its left side, where +X is the declarer's score (-Y is the defending team's score, which is irrelevant for alpha beta). As I said, the winner of a trick (declarer or defending team) adds the value of each card in this trick to their score.

The card values are:


Card  J  A   10  K  Q  9  8  7
Value 2  11  10  4  3  0  0  0

If you have questions about Skat/its rules nonetheless, I would love to elaborate.

I have written an alpha beta algorithm in Java which seems to work fine, but it's way too slow. The first enhancement that seems the most promising is the use of a transposition table (TT). I read that when searching the tree of a Skat game, one will encounter a lot of positions that have already been investigated.

In theory, I figured, when I find a position that has already been investigated before, I need to substitute the "old" position's running score with the score of the current position. In other words: When I store a position in the TT, I take the optimal value that alpha beta returns, subtract the node's running score from it, and associate the difference with the position. To my understanding, this difference represents the maximum points that the declarer can achieve in this subtree, with the current position as the root. Analogously, when I find a previously investigated position, I return the sum of the previously stored value and the current position's running score.

Enough with the theory, here's how I implemented it in Java:

The storing routine of a node:


// "bestVal" is alpha or beta, depending on the player
// "tranpo" is a HashMap<Integer, int[]>
int val = bestVal - node.getScore();
int hash = logic.hashNode(node);
// "TT_VALID" is a flag constant
transpo.put(hash, new int[]{TT_VALID, val});

And the lookup:


int hash = logic.hashNode(node);
int[] sameState = transpo.get(hash);
if(sameState != null) {
   int val = node.getScore() + sameState[1];
   if(sameState[0] == TT_VALID) {
      return val;
   }
}

I'm only storing "VALID" nodes for now. Those are the nodes that haven't been pruned by alpha/beta.

The problem is, that it won't return the same (correct) results that the normal alpha beta algorithm returns. And the moves it suggests are obviously suboptimal, but not necessarily "absurd".

I've been debugging this for a while, without success. The hashing function seems to work correctly (I tested it by comparing allegedly identical nodes with their verbal represantations). More importantly, I don't suspect any error in the alpha beta algorithm itself, since it works perfectly without the TT. If there's a reason that my assumptions could be wrong, please let me know, and I'll post more code.

Given, my assumptions are correct, the problem has to be in the storing or lookup routine. What am I missing? Is my whole approach logically flawed? Or is it something technical I've overlooked?

Since I don't expect a magical solution to appear, I'm mainly asking about how I would debug this. Since the algorithm makes millions of recursive calls, I can't just let it run line by line and sit there with my calculator until I find the mysterious error. Where should I start debugging? Maybe some kind of back-to-back test with the working alpha beta function? Or what inconsistencies could I look for?

References:

There's almost no English sources out there that deal with AI in skat games, but I found this one: A Skat Player Based on Monte Carlo Simulation by Kupferschmid, Helmert. Unfortunately, the whole paper and especially the elaboration on transposition tables is rather compact.

Again, if there's something you would like to have clarified, don't hesitate to ask!

Advertisement
I have played Skat before, actually. I don't see what part of Skat is a full-information game suitable for alpha-beta search, though...

Regarding transposition tables, it's very tricky to get right. Depending on whether the result of alpha-beta is alpha or below, beta or above, or between alpha and beta, the result is an upper bound, a lower bound, or the exact minimax score of the position, respectively. The transposition tables should store which case we found and when we get a hash hit that is not an exact score we should only update alpha or beta (whichever one is appropriate). Also, we should store the depth with which the node was searched, and only use the score if the depth from the hash tables is enough.

I am sure the chess programming wiki has detailed information about all of this.

I don't see what part of Skat is a full-information game suitable for alpha-beta search, though...

Of course, a player is usually only able to see his own cards. I'm writing the program because I want to find out the optimal strategy for a certain deal. This comes in handy for instance for a post-mortem analysis of a game that was played before. Naturally, this won't be applicable for a real card playing engine where a player only knows his own cards. But that's not my goal anyway.


[...] the result is an upper bound, a lower bound, or the exact minimax score of the position [...]

Yes, I left out certain parts of my code. I do store the actual value of the subgame and a flag. In my code it's TT_VALID when the value is between alpha and bet. When a node gets pruned, the flag is TT_LBOUND or TT_UBOUND, respectively. Right now, I'm only storing valid nodes, but even this yields false results. Or are you suggesting that not working with bounds right now could be the cause for those results?


Also, we should store the depth with which the node was searched [...]

Actually, the depth is implicitely represented in the hash itself: I use a 32-bit integer, in which 30 bits represent the remaining cards (30 cards are in play, 2 are in the Skat). The other 2 bits represent the player to move. This means that if a hash is equal, every hand is the same; ergo, the depth must be the same, too. The way I see it, it only makes sense to hash only those nodes that conclude a trick (whenever three cards are on the table), because that's the earliest situation in which two nodes with different histories can represent the same situation.

You basically explained the concepts of transposition tables. Does this mean you suspect the problem in the implementation?

The "depth" I am referring to here is the "depth left". Presumably you are not examining the whole tree exhaustively, but you have a depth limit and use a heuristic evaluation function at the leaves. If you are doing that, you might be using other common refinements, like extensions for forced moves, reductions for moves that show little promise, iterative deepening... All of these things can result in the same position being examined with different "depth left" at different times.

A partial implementation of transposition tables that only stores exact scores shouldn't break anything, but you have to be careful to only store at the right times. That is, whenever one of the successors has yielded a score better than alpha and none of them has yielded a score that is beta of better.

Yes, I suspect there is a problem with the implementation. As I said earlier, it is very tricky to get transposition tables right, even when you fully understand the theory behind them.

I haven't considered any improvements other than a transposition table, and I'm not planning to consider them until the TT implementation is working. I currently am doing an exhaustive search, no heuristics involved. The same thing applies for your elaboration on "depth"; I'll consider that when I got the TT working in the first place.


[...] whenever one of the successors has yielded a score better than alpha and none of them has yielded a score that is beta of better.

Why is that? The way my implementation works (or should work) right now, is returning the exact score achieved in the respective subgame. Why not let the algorithm decide which node to pick when we return that value? Moreover, if we find a node that has been investigated before, alpha and beta can vary tremendously, depending on the history. Can we really effectively make assumptions for every other position that could come, when storing a position in the TT?


Yes, I suspect there is a problem with the implementation.

What do you suggest doing? Is there something wrong with the code? How can I debug such things?

I haven't considered any improvements other than a transposition table, and I'm not planning to consider them until the TT implementation is working. I currently am doing an exhaustive search, no heuristics involved. The same thing applies for your elaboration on "depth"; I'll consider that when I got the TT working in the first place.


If you are doing exhaustive search, everything I said about depth doesn't matter.



[...] whenever one of the successors has yielded a score better than alpha and none of them has yielded a score that is beta of better.



Why is that? The way my implementation works (or should work) right now, is returning the exact score achieved in the respective subgame.


If that's the case, you are not using alpha-beta search: You are doing straight minimax search. Alpha-beta doesn't always return exact scores.


Why not let the algorithm decide which node to pick when we return that value?



I don't know what that sentence means.


Moreover, if we find a node that has been investigated before, alpha and beta can vary tremendously, depending on the history.


Yes, of course.


Can we really effectively make assumptions for every other position that could come, when storing a position in the TT?


You can store what you have discovered, so it can be used in the future. You are not assuming anything.


Yes, I suspect there is a problem with the implementation.


What do you suggest doing? Is there something wrong with the code? How can I debug such things?


Start by only storing an exact score when it is actually the case that the value obtained by the search is an exact score.

Why are you storing the difference between bestEval and node.getScore() instead of just storing bestEval? What's the point of node.getScore() anyway if you are doing exhaustive search?

If that's the case, you are not using alpha-beta search: You are doing straight minimax search. Alpha-beta doesn't always return exact scores.

I am currently only storing nodes that haven't been pruned/cut off. In this case we do get the minimax value, don't we?


Why are you storing the difference between bestEval and node.getScore() instead of just storing bestEval? What's the point of node.getScore() anyway if you are doing exhaustive search?

I am doing the subtraction in order to "decouple" the subtree from its history. In my opinion, this difference will express: "How many points can the declarer achieve, starting from this position?"

If I don't do the subtraction, what am I supposed to do with the value I retrieve from the TT? Simply return that value in case it's valid? In all likelihood, the running score of those nodes will be different due to their different histories. Consequentially, the points the declarer achieved in the other situation won't be the same he will achieve in this situation. That's why I store the "points achievable by the declarer, starting from the current position"; and when I find this position later in the TT, I take the current score and add those "achievable points from this position"; which should result in the minimax value, if the flag is VALID.

Just to be sure, I tested it by only storing bestVal in the TT and returning this value in the lookup routine, and both the resulting values and strategy were far off from the optimal course of play. More importantly, the final value that alpha beta returned wasn't consistent with the real score suggested by the optimal strategy.

Storing routine:


int hash = logic.hashNode(node);
transpo.put(hash, new int[]{TT_VALID, bestVal});

Lookup routine:


int hash = logic.hashNode(node);
int[] sameState = transpo.get(hash);
if(sameState != null) {
   if(sameState[0] == TT_VALID) {
      return sameState[1];
   }
}

If that's the case, you are not using alpha-beta search: You are doing straight minimax search. Alpha-beta doesn't always return exact scores.


I am currently only storing nodes that haven't been pruned/cut off. In this case we do get the minimax value, don't we?


No, if you didn't find a successor that was better than alpha, you only know that the minimax score of this node is alpha or worse.


Why are you storing the difference between bestEval and node.getScore() instead of just storing bestEval? What's the point of node.getScore() anyway if you are doing exhaustive search?



I am doing the subtraction in order to "decouple" the subtree from its history. In my opinion, this difference will express: "How many points can the declarer achieve, starting from this position?"


Yes, you are right about storing the difference between search result and current score. Sorry, I forgot you described this in your first post.


No, if you didn't find a successor that was better than alpha, you only know that the minimax score of this node is alpha or worse.

I taxed my brain, and I think i got it. Only because the node itself wasn't pruned, doesn't neccessarily mean that subsequent nodes further down the tree haven't been pruned, right?

To your suggestion of only storing the successors better than alpha: Thanks! It works when I do it this way! rolleyes.gif

The function returns the same value and strategy as normal alpha beta, but nearly 2.5 times faster, and depending on the deal, there are between 80 and 500 million positive lookups. I still don't quite get why that is. Could you please elaborate a bit why I should only store values better than alpha?

Also: How would I go about storing pruned nodes? I understand the concept of bounds, but should I make similar constraints as you suggested for VALID nodes?

And last but not least a basic problem: As I said earlier, I only store nodes that conclude a trick, since that's the earliest moment, I could possibly find a match. Do you agree with that? If yes, can I make the same assumption for prund nodes?

My intuition is that it would be better to store all nodes, not just the ones that conclude a trick. But these things can only be settled by testing.

If you want to store and use bounds, I'll try to explain the technique as simply as I can. Let's start with alpha-beta without transposition tables:
int negamax(Position P, int alpha, int beta) {
  if (P.is_terminal())
    return P.score_from_the_point_of_view_of_player_to_move();
  
  for (Move move : P.legal_moves()) {
    P.make_move(move);
    int score = -negamax(P, -beta, -alpha);
    P.undo_move(move);
    if (score > alpha) {
      alpha = score;
      if (score >= beta)
        break;
    }
  }
  
  return alpha;
}
Now we have to modify in two ways: To retrieve and use data from the transposition table, and to store it.

int negamax(Position P, int alpha, int beta) {
  if (P.is_terminal())
    return P.score_from_the_point_of_view_of_player_to_move();
  
  int hash_lower_bound;
  int hash_upper_bound;
  bool hash_hit = TT.retrieve(P.get_hash(), &hash_lower_bound, &hash_upper_bound);
  if (hash_hit) {
    if (alpha < hash_lower_bound)
      alpha = hash_lower_bound;
    if (beta > hash_upper_bound)
      beta = hash_upper_bound;
    if (alpha >= beta)
      return alpha;
  }
  
  BoundType bound_type = UpperBound;
  
  for (Move move : P.legal_moves()) {
    P.make_move(move);
    int score = -negamax(P, -beta, -alpha);
    P.undo_move(move);
    if (score > alpha) {
      alpha = score;
      bound_type = Exact;
      if (score >= beta) {
        bound_type = LowerBound;
        break;
      }
    }
  }
  
  TT.store_hash(P.get_hash(), bound_type, alpha);
  
  return alpha;
}
[EDIT: In case the pseudo-C++ above doesn't make it clear enough, I suggest you store a lower bound and an upper bound in the transposition table. These can possibly be -infinity or +infinity, of course.]

Let me know if anything is not clear or doesn't make sense to you.

You should also spend some effort working on move ordering. Trying promising moves first can effect a dramatic reduction in the time it takes to explore a tree using alpha-beta search. In particular, you should probably store the best move in the transposition table, because you can explore that move first the next time you explore this node again, even if the bounds you found aren't good enough to stop the search. Some sort of killer heuristic is probably also very effective.

This topic is closed to new replies.

Advertisement