Suduko Pattern : magical id formula ?

Started by
19 comments, last by SamLowry 15 years, 8 months ago
That's very interesting ToohrVyk. I missed how 81 digits fits into 30 characters though?

Prinzessin, I believe what ToohrVyk is saying is that your string ID *is* the puzzle, and you can quickly compare 10,0000 (?) of them with using a trie. This would also be perfect, as opposed to rejecting identical hashes that might or might not actually represent the same sudoku puzzle (not that that was a bad idea or anything).

Cheers
-Scott

[Edit] 00 - 99, packed in 7 bits each? (~37 bytes)
Advertisement
The naïve way of fitting things is to store two digits in each byte (since you only need 4 bits to represent a value between 1 and 9). This means that 81 values are stored as 41 bytes.

But, you don't need to store the last number in a line. Meaning you can drop the 9 rightmost values (that's 72 values or 36 bytes). You also don't need to store the last number in a column (the remaining 8 values at the bottom, which leaves 64 values or 32 bytes). Finally, there are 4 full squares remaining where one of the values can be removed and inferred from the 8 others. That's 60 values or 30 bytes.

Even better is to pack the values tighter (basically, you use the 60 digits as the digits of a number in base 9, and then write that number in base 256) which only requires 24 bytes (that's merely 6 integers), but is a little harder to compute.

EDIT: even better, since you only need to represent eight permutations of [1-9], this only requires 19 bytes when fully packed. If you also consider that you don't take into account permutations of the numbers (that is, the first line is always sorted 1-to-9) then only seven permutations of [1-9] are required, then you can reduce this number to 17 bytes when fully packed Even better, if you notice that things aren't real permutations (because the rows above them may interfere), you can get down to a worst-case of 14 bytes (but the computational cost is starting to get a bit heavy).

[Edited by - ToohrVyk on August 11, 2008 8:24:24 AM]
My (elaborate) approach to generating unique IDs for puzzles would go like this: I would start off with a constraint solver, which keeps track of which numbers can go in which cells. An empty sudoku grid would have [1,2,3,4,5,6,7,8,9] attached to each cell. If I were to fix the upper-left cell to 1, the cells in the same row, column and square would be left with [2,3,4,5,6,7,8,9] etc.

I begin with the empty grid and consider the first cell: there are 9 possibilities to choose from. I take a pick and write that down as the first element in a list: (k, 9) where k is a number of 0 to 8 (the index of the chosen option from the list) and 9 stands for the number of possibilities I had. I would then let the solver propagate the constraints (i.e. it removes the chosen number from the appropriate lists). I then consider the second cell, which can be filled with 8 possible values. I pick one, and write down (m, 8), etc. I go on like this: going to the next cell which has not yet been fixed, pick one of the possible values, and write that down in the list. Cells which are fixed by constraint propagation can thus be skipped.

I end up with a list like [(3, 8), (4, 7), (0, 6), ...] which uniquely defines the sudoku puzzle since it contains all choices I made during its construction. Technically, I only need the first number of each pair, but the second is needed to encode it to a single number:
encode []          = 0encode ((k, n):xs) = encode xs * n + k

The resulting number uniquely identifies the grid, and I can rebuild the entire grid from that number.

This encoding should have low or perhaps even zero entropy. It would be interesting to compute the largest possible value for this ID, as it would represent the amount of information encoded in a sudoku grid (at least if the encoding has zero entropy...)
Thank you all !!! Your posts have been of great help. I am currently trying to add this to my algorithm. Will post my updates here...
Quote:Original post by SamLowry
My (elaborate) approach to generating unique IDs for puzzles would go like this: I would start off with a constraint solver, which keeps track of which numbers can go in which cells. An empty sudoku grid would have [1,2,3,4,5,6,7,8,9] attached to each cell. If I were to fix the upper-left cell to 1, the cells in the same row, column and square would be left with [2,3,4,5,6,7,8,9] etc.


Does this mean I should be doing this at first? I mean while actual generation of the puzzle? I'm not sure I understand how to generate the pairs...could you please help me out a little more? Thanks

I mean for the following puzzle and solution, which one would i be working on?


8 0 0 9 0 2 0 0 0
5 0 2 0 0 3 0 0 0
0 0 0 0 5 0 9 3 0
0 8 5 0 7 0 0 0 0
0 6 0 1 3 0 0 0 0
0 0 0 0 0 0 1 7 4
0 0 0 7 9 8 0 0 3
0 1 7 3 0 0 0 0 9
0 0 0 0 0 0 5 4 7


8 7 3 9 6 2 4 5 1
5 9 2 4 1 3 7 6 8
6 4 1 8 5 7 9 3 2
1 8 5 2 7 4 3 9 6
7 6 4 1 3 9 8 2 5
3 2 9 5 8 6 1 7 4
4 5 6 7 9 8 2 1 3
2 1 7 3 4 5 6 8 9
9 3 8 6 2 1 5 4 7
While I don't know how a Sudoku generator works, my estimate is that it might be something like "fill 2-3 random numbers into an empty grid and try to solve it". If it solves, output the desired amount of numbers, else repeat.

If it is anything like this, then a working solution might be to seed the random number generator with a known value as the very first thing, for every new Sudoku. If the Sudoku solves, remember that initialisation number.
That way, you have "encoded" every Sudoku with an unique number which is only the size of a single integer, and the processing overhead is zero.

On top, you can trade storage for computing time if you have millions of Sudokus to keep in your records. That is, you never need to store the whole Sudoku, because you know that you can always regenerate it from that one number.
Some good suggestions here.
For the love of god, please tell me that you've just omitted your error checking code for brevity, and you don't really assume that all those functions succeed.
I went with the Goedel formula. It was the simplest. I am calculating the Goedel number (http://en.wikipedia.org/wiki/G%C3%B6del_number)for each puzzle then adding the puzzle to a hash set.
Quote:Original post by prinzessin
Quote:Original post by SamLowry
My (elaborate) approach to generating unique IDs for puzzles would go like this: I would start off with a constraint solver, which keeps track of which numbers can go in which cells. An empty sudoku grid would have [1,2,3,4,5,6,7,8,9] attached to each cell. If I were to fix the upper-left cell to 1, the cells in the same row, column and square would be left with [2,3,4,5,6,7,8,9] etc.


Does this mean I should be doing this at first? I mean while actual generation of the puzzle? I'm not sure I understand how to generate the pairs...could you please help me out a little more? Thanks

I mean for the following puzzle and solution, which one would i be working on?


8 0 0 9 0 2 0 0 0
5 0 2 0 0 3 0 0 0
0 0 0 0 5 0 9 3 0
0 8 5 0 7 0 0 0 0
0 6 0 1 3 0 0 0 0
0 0 0 0 0 0 1 7 4
0 0 0 7 9 8 0 0 3
0 1 7 3 0 0 0 0 9
0 0 0 0 0 0 5 4 7


8 7 3 9 6 2 4 5 1
5 9 2 4 1 3 7 6 8
6 4 1 8 5 7 9 3 2
1 8 5 2 7 4 3 9 6
7 6 4 1 3 9 8 2 5
3 2 9 5 8 6 1 7 4
4 5 6 7 9 8 2 1 3
2 1 7 3 4 5 6 8 9
9 3 8 6 2 1 5 4 7

Neither :)

First of all, I'll repeat that my algorithm is very elaborate and rather complex, so you might want to go for ToohrVyk's approach.

Let's first state the exact problem: we want to define a one-to-one relationship between sudoku grids and naturals numbers, i.e. each grid is associated with exactly one number, and each number is associated with exactly one grid. Given the puzzle, you can compute the number, and vice versa. Also, the encoding must have zero entropy, i.e. if N is the largest possible ID, all numbers from 0 to N must represent a different grid. (Thus, this is not really what you need in order to quickly discover duplicates in sets... a solution to this problem is total overkill for your situation)

My approach is to generate a sudoku puzzle and to write down each choice you make. Using the seed of the random number generator used as ID (as mentioned in a previous post) is somewhat along these lines; however, it does not fully comply to the problem definition above (yet it certainly is a good approximative solution).

A sudoku grid consists of 81 cells. Each of these cells needs to contain a digit from 1 to 9. You could directly concatenate each of these digits into one big number, but that wouldn't satisfy the zero-entropy constraint, i.e. there would be plenty of invalid IDs, such as any beginning with "11..." as it would mean that the first row contains two 1's.
So, in order to prevent this from happening, we use a solver. The solver takes an incomplete grid and returns whether or not it can be solved, that is, there is at least one solution possible. E.g. if given the empty (=constraintless) grid, it will return true; if given a grid with 1 1 on one row, it will return false. The solver must not only check for the typical sudoku constraints (= no duplicate digits in the same row/column/square), but must fully solve the grid to make sure there is a possible solution. That's because you could have an incomplete grid that satisfies all sudoku-constraints at first sight, yet has no solution.

So, given our solver, we start constructing the sudoku grid. We start off with the empty grid, and consider the first cell. We want to know (let's assume we're really stupid) what possible digits can be put in this first cell. We just know the general rule that each cell eventually contains a digit from 1 to 9. So, we first fill in 1 in that first cell, and pass it to the solver. It will of course return true. So, we know that 1 is a possible value for the first cell. We do the same for 2, 3, 4, 5, 6, 7, 8 and 9, and thus build a list of possible values for that first cell. We already know that any of these digits can be filled in, so we'll end up with the list [1, 2, 3, 4, 5, 6, 7, 8, 9]. Now, we have to make a choice and write it down, as this part of the sudoku construction is nondeterministic. Let's say we choose 5, this is the option with (0-based) index 4 in our list. So, we write this information down somewhere, namely in a list of pairs: (4, 9), which means "we had 9 choices, of which we chose the 4th in the list".

So, now we have built ourselves a partial grid with 5 in the very first cell. We repeat the same process for each of the remaining 80 cells: we fill in the N-th cell with each of the digits 1..9, and ask the solver "could this lead to a solution?". We build a list of possible digits, make a random choice, and write down this choice as an extra item in our pairs-list.

Note: the order in which the cells are filled in must be fixed. I assume this happens row by row. It could work column by column, or any other seemingly random order, but the order has an effect on the ID-grid association: you can only decode an ID if you already know the order in which the cells were filled in, so it's best to choose one order (e.g. row-by-row, left-to-right) and stick by it.

Our second cell will have option list [1, 2, 3, 4, 6, 7, 8, 9], and let's pretend we choose 2: the pairs-list becomes [(4, 9), (2, 8)]. Next, the third cell will have option list [1, 3, 4, 6, 7, 8, 9], and so on. The ninth cell will be fully constrained, as there will be only one option left, so the 9th entry in our pairs list will be (0, 1), meaning "1 choice, so we went with the one with index 0". The first nine pairs in the list will probably look something like [(<9, 9), (<8, 8), (<7, 7), (<6, 6) (< 5, 5), (< 4, 4), (<3, 3), (< 2, 2), (0, 1)] where "< N" stands for "a number less than N". Note that there could be duplicates in those numbers, e.g. [(0, 9), (0, 8), (0, 7), (0, 6), (0, 5), (0, 4), (0, 3), (0, 2), (0, 1)] is a valid list for the first 9 choices, and represents the choosing of the first option each time, i.e. the grid's upper row will be filled with "1 2 3 4 5 6 7 8 9". Conversely, [(8, 9), (7, 8), (6, 7), (5, 6), (4, 5), (3, 4), (2, 3), (1, 2), (0, 1)] stands for always making the last choice, thus "9 8 7 6 5 4 3 2 1".

In the end, we'll end up with a list of 81 pairs, one for each pair. This list is thus some unique representation of the sudoku grid. As last step towards a unique ID, we need to convert this list to a number. This is done by using a "variable base encoding", or whatever you'd call it. An everyday number, e.g. 475, is written in base 10 (decimal), and a computer number (e.g. 10010110 or 5E12A) is often written in base 2 (binary) or base 16 (hexadecimal). These use a "fixed base", as every digit is multiplied by a power of the same base: 475 = 4 * 10^2 + 7 * 10^1 + 5 * 10^0. This way, given a base, the same value can be encoded in a unique manner. The problem is, one has to know beforehand which base is used. If I give you the number 1000, you technically can't know what value it represents: does it use base 10, base 2, base 5542? Context helps you determine which base is used.

I use variable base, i.e. each digit can be multiplied by a different base power, which only exacerbates the problem: in order to "decode" a number, we have to know beforehand which bases were used in its creation. Fortunately, this poses no problem, as I'll explain later.

So, let's first concentrate on how to convert the list to a unique natural number. Simple:
-- Single argument = pair listencode []             = 0encode ((n, b) :: xs) = encode xs * b + n

Examples:
[]               == 0[(5, 8)]         == 0 * 8 + 5           == 5[(2, 5), (3, 8)] == (0 * 8 + 3) * 5 + 2 == 17[(3, 7), (1, 2)] == (0 * 2 + 1) * 7 + 3 == 10


That's the encoding part :) I warned you it was elaborate. Now for the decoding part...

Well, as a first step, we need a way to convert a natural number to a choice list. As mentioned before, we need the actual bases used for that. Let's postpone that detail to later.
-- First argument = ID-- Second argument = list of basesdecode n []      = []decode n (b::bs) = (rest, b) :: decode ((n - rest) / b) bs                   where                     rest = mod n b               --- mod n b == n % b

Examples (inverse of previous ones)
decode 0 []      == []decode 5 [8]     == (5 % 8, 8) :: [] == [(5, 8)]decode 17 [5, 8] == (17 % 5, 5) :: decode ((17 - (17 % 5)) / 5) [8]                 == (2, 5) :: decode 3 [8]                 == (2, 5) :: (3 % 8, 8) :: []                 == [(2, 5), (3, 8)]decode 10 [7, 2] == (10 % 7, 7) :: decode ((10 - (10 % 7)) / 7) [2]                 == (3, 7) :: decode 1 [2]                 == (3, 7) :: (1 % 2, 2) :: []                 == [(3, 7), (1, 2)]

We won't be able to use this decode function, as we'll only discover the bases step-by-step, but it's proof that we can, given the bases, reconstruct our choice list from the ID.

Given an ID, how to rebuild our sudoku grid? Well, we start off the same way: we begin at the first cell, try to fill in all digits 1-9 and ask the solver which are valid choices, resulting in an option list for the first cell. The length L of this list is our first base. So, we can partially apply our decoder to find which choice we have to make for this cell: ID % L is this choice. We then "reduce" our ID: (ID - (ID % L)) / B is our new ID, describing the choices made in the next 80 cells.

Thus, we repeat the same step 80 more times: in the N-th cell, we fill in the digits 1-9, ask the solver if it leads to a solution, build our option list, compute its length, partially decode the ID yielding our next choice, reduce our ID, and move on to the next cell.

The nice thing about the encoding is that fully constrained cells (where there's only one option left, such as the 9th cell in a row of which the 8 first cells have already been decided) is that they don't add anything to the number: if we have no choice, our "choice pair" becomes (0, 1), and it has zero effect on our ID, as it multiplies with 1 and adds 0, meaning the skipping of fully constrained cells is done automatically and doesn't need special code.

That's it then. I believe this algorithm to be zero-entropy. If you can find the largest number this encoding can yield, be sure to let me know :)
Thank you SamLowry for taking the time and effort to explain all this. It's actually clear to me now. Thanks again for your help.

This topic is closed to new replies.

Advertisement