Sign in to follow this  
prinzessin

Suduko Pattern : magical id formula ?

Recommended Posts

Hello Everyone, I have recently implemented a Sudoku generator and solver. My question is: Is there some kind of magical number that uniquley represents a solved Sudoku-pattern? a mathematical formula that represents a unique id of the puzzle. Here is why: my program generates n different puzzles and solutions... i would like to compare these puzzles to check if the same one was generated twice (to eliminate it) without actually having to compare the values of the array one by one? In other words is there a number that identifies a sudoku ?

Share this post


Link to post
Share on other sites
Compare the grids: a grid is only 81 digits long, which (when you put two digits in every character) can easily fit in about 30 characters. Comparing 30 characters for equality is extremely fast, given that the probability of stopping after the first character is higher than 99%!

Share this post


Link to post
Share on other sites
Thanks but I'm afraid I cant do that. I am generating about 10,0000 text files each containing the puzzle followed by the solution.
The puzzle is identified by the file name (from 1 to 10,0000) and it is this number that i store in my database.
The idea was to add each puzzle's magic identifier (if it exists) to a hash set and therefore make duplication of puzzles impossible.

i cant compare the grids : what would I do ? Compare each grid with all the existing ones ??

Share this post


Link to post
Share on other sites
you can do something simple like, a godel numbering type trick (raise the primes to the appropriate powers and multiply)

but interestingly it might be tough if you really want the lowest entropy system. the minimum amount of information should be stored. so you might need to know the maximum amount of numbers that can be taken off the puzzle so that it is still solvable into only one final pattern. but then it might turn out that those numbers are related in yet another way outside the bounds of sodoku puzzle rules.

i dont know.. there is definitly easy ways to hash it so that you dont get the same puzzle twice, but now you have me curious about how much info is actually stored in sodokus... i wrote a generator for the mma math modeling competition and haven't touched soduku since (3 days straight of work)

Share this post


Link to post
Share on other sites
It is certainly possible to associate a unique natural number with each sudoku puzzle (as the number of sudoku puzzles is finite), but converting such an ID to puzzle or vice versa is probably nontrivial.

If your only concern is avoiding duplicates, you could get away with just using hashes. If the hashes are different, it is certain the puzzles are different. However, if the hashes are equal, the puzzles might be equal, but that doesn't really cause trouble: if they have the same hash, you can just assume they're the same puzzle, throw one away and generate a replacement. So, it is a conservative approach in the sense that you'll be rejecting sets of 50 puzzles with no duplicates, but at least you're sure you'll have no 50 puzzles with duplicates.

As a hash number, you could use some operation on the upper row of the grid. If [n] stands for the position of n in the first row, i.e. a number from 0 to 8, then you could create a hash as sum([i+1] * i ^ 9, i=0..8) or something. The hash fits in a 32-bit integer (as it could not be larger than 9^9 = 387,420,489), and you'll have 362880 different possible hash values, which is certainly enough to guarantee that you won't end up in a long loop when trying to generate a new sudoku puzzle different from 8 others as there is only 2.76E-6 chance of a collision.

Share this post


Link to post
Share on other sites
Quote:
Original post by SamLowry
It is certainly possible to associate a unique ID with each sudoku puzzle, but converting such an ID to puzzle or vice versa is probably nontrivial.

I dont want to convert back and forth... I will actually store this number in a hash set. But how do I reach this number?

Share this post


Link to post
Share on other sites
Quote:
Original post by prinzessin
Quote:
Original post by SamLowry
It is certainly possible to associate a unique ID with each sudoku puzzle, but converting such an ID to puzzle or vice versa is probably nontrivial.

I dont want to convert back and forth... I will actually store this number in a hash set. But how do I reach this number?


I updated my previous post. (Sorry, I tend to do that a lot, because often posters as not as quick as you :))

Share this post


Link to post
Share on other sites
That's very interesting ToohrVyk. I missed how 81 digits fits into 30 characters though?

Prinzessin, I believe what ToohrVyk is saying is that your string ID *is* the puzzle, and you can quickly compare 10,0000 (?) of them with using a trie. This would also be perfect, as opposed to rejecting identical hashes that might or might not actually represent the same sudoku puzzle (not that that was a bad idea or anything).

Cheers
-Scott

[Edit] 00 - 99, packed in 7 bits each? (~37 bytes)

Share this post


Link to post
Share on other sites
The naïve way of fitting things is to store two digits in each byte (since you only need 4 bits to represent a value between 1 and 9). This means that 81 values are stored as 41 bytes.

But, you don't need to store the last number in a line. Meaning you can drop the 9 rightmost values (that's 72 values or 36 bytes). You also don't need to store the last number in a column (the remaining 8 values at the bottom, which leaves 64 values or 32 bytes). Finally, there are 4 full squares remaining where one of the values can be removed and inferred from the 8 others. That's 60 values or 30 bytes.

Even better is to pack the values tighter (basically, you use the 60 digits as the digits of a number in base 9, and then write that number in base 256) which only requires 24 bytes (that's merely 6 integers), but is a little harder to compute.

EDIT: even better, since you only need to represent eight permutations of [1-9], this only requires 19 bytes when fully packed. If you also consider that you don't take into account permutations of the numbers (that is, the first line is always sorted 1-to-9) then only seven permutations of [1-9] are required, then you can reduce this number to 17 bytes when fully packed Even better, if you notice that things aren't real permutations (because the rows above them may interfere), you can get down to a worst-case of 14 bytes (but the computational cost is starting to get a bit heavy).

[Edited by - ToohrVyk on August 11, 2008 8:24:24 AM]

Share this post


Link to post
Share on other sites
My (elaborate) approach to generating unique IDs for puzzles would go like this: I would start off with a constraint solver, which keeps track of which numbers can go in which cells. An empty sudoku grid would have [1,2,3,4,5,6,7,8,9] attached to each cell. If I were to fix the upper-left cell to 1, the cells in the same row, column and square would be left with [2,3,4,5,6,7,8,9] etc.

I begin with the empty grid and consider the first cell: there are 9 possibilities to choose from. I take a pick and write that down as the first element in a list: (k, 9) where k is a number of 0 to 8 (the index of the chosen option from the list) and 9 stands for the number of possibilities I had. I would then let the solver propagate the constraints (i.e. it removes the chosen number from the appropriate lists). I then consider the second cell, which can be filled with 8 possible values. I pick one, and write down (m, 8), etc. I go on like this: going to the next cell which has not yet been fixed, pick one of the possible values, and write that down in the list. Cells which are fixed by constraint propagation can thus be skipped.

I end up with a list like [(3, 8), (4, 7), (0, 6), ...] which uniquely defines the sudoku puzzle since it contains all choices I made during its construction. Technically, I only need the first number of each pair, but the second is needed to encode it to a single number:

encode [] = 0
encode ((k, n):xs) = encode xs * n + k

The resulting number uniquely identifies the grid, and I can rebuild the entire grid from that number.

This encoding should have low or perhaps even zero entropy. It would be interesting to compute the largest possible value for this ID, as it would represent the amount of information encoded in a sudoku grid (at least if the encoding has zero entropy...)

Share this post


Link to post
Share on other sites
Quote:
Original post by SamLowry
My (elaborate) approach to generating unique IDs for puzzles would go like this: I would start off with a constraint solver, which keeps track of which numbers can go in which cells. An empty sudoku grid would have [1,2,3,4,5,6,7,8,9] attached to each cell. If I were to fix the upper-left cell to 1, the cells in the same row, column and square would be left with [2,3,4,5,6,7,8,9] etc.


Does this mean I should be doing this at first? I mean while actual generation of the puzzle? I'm not sure I understand how to generate the pairs...could you please help me out a little more? Thanks

I mean for the following puzzle and solution, which one would i be working on?


8 0 0 9 0 2 0 0 0
5 0 2 0 0 3 0 0 0
0 0 0 0 5 0 9 3 0
0 8 5 0 7 0 0 0 0
0 6 0 1 3 0 0 0 0
0 0 0 0 0 0 1 7 4
0 0 0 7 9 8 0 0 3
0 1 7 3 0 0 0 0 9
0 0 0 0 0 0 5 4 7


8 7 3 9 6 2 4 5 1
5 9 2 4 1 3 7 6 8
6 4 1 8 5 7 9 3 2
1 8 5 2 7 4 3 9 6
7 6 4 1 3 9 8 2 5
3 2 9 5 8 6 1 7 4
4 5 6 7 9 8 2 1 3
2 1 7 3 4 5 6 8 9
9 3 8 6 2 1 5 4 7

Share this post


Link to post
Share on other sites
While I don't know how a Sudoku generator works, my estimate is that it might be something like "fill 2-3 random numbers into an empty grid and try to solve it". If it solves, output the desired amount of numbers, else repeat.

If it is anything like this, then a working solution might be to seed the random number generator with a known value as the very first thing, for every new Sudoku. If the Sudoku solves, remember that initialisation number.
That way, you have "encoded" every Sudoku with an unique number which is only the size of a single integer, and the processing overhead is zero.

On top, you can trade storage for computing time if you have millions of Sudokus to keep in your records. That is, you never need to store the whole Sudoku, because you know that you can always regenerate it from that one number.

Share this post


Link to post
Share on other sites
I went with the Goedel formula. It was the simplest. I am calculating the Goedel number (http://en.wikipedia.org/wiki/G%C3%B6del_number)for each puzzle then adding the puzzle to a hash set.

Share this post


Link to post
Share on other sites
Quote:
Original post by prinzessin
Quote:
Original post by SamLowry
My (elaborate) approach to generating unique IDs for puzzles would go like this: I would start off with a constraint solver, which keeps track of which numbers can go in which cells. An empty sudoku grid would have [1,2,3,4,5,6,7,8,9] attached to each cell. If I were to fix the upper-left cell to 1, the cells in the same row, column and square would be left with [2,3,4,5,6,7,8,9] etc.


Does this mean I should be doing this at first? I mean while actual generation of the puzzle? I'm not sure I understand how to generate the pairs...could you please help me out a little more? Thanks

I mean for the following puzzle and solution, which one would i be working on?


8 0 0 9 0 2 0 0 0
5 0 2 0 0 3 0 0 0
0 0 0 0 5 0 9 3 0
0 8 5 0 7 0 0 0 0
0 6 0 1 3 0 0 0 0
0 0 0 0 0 0 1 7 4
0 0 0 7 9 8 0 0 3
0 1 7 3 0 0 0 0 9
0 0 0 0 0 0 5 4 7


8 7 3 9 6 2 4 5 1
5 9 2 4 1 3 7 6 8
6 4 1 8 5 7 9 3 2
1 8 5 2 7 4 3 9 6
7 6 4 1 3 9 8 2 5
3 2 9 5 8 6 1 7 4
4 5 6 7 9 8 2 1 3
2 1 7 3 4 5 6 8 9
9 3 8 6 2 1 5 4 7

Neither :)

First of all, I'll repeat that my algorithm is very elaborate and rather complex, so you might want to go for ToohrVyk's approach.

Let's first state the exact problem: we want to define a one-to-one relationship between sudoku grids and naturals numbers, i.e. each grid is associated with exactly one number, and each number is associated with exactly one grid. Given the puzzle, you can compute the number, and vice versa. Also, the encoding must have zero entropy, i.e. if N is the largest possible ID, all numbers from 0 to N must represent a different grid. (Thus, this is not really what you need in order to quickly discover duplicates in sets... a solution to this problem is total overkill for your situation)

My approach is to generate a sudoku puzzle and to write down each choice you make. Using the seed of the random number generator used as ID (as mentioned in a previous post) is somewhat along these lines; however, it does not fully comply to the problem definition above (yet it certainly is a good approximative solution).

A sudoku grid consists of 81 cells. Each of these cells needs to contain a digit from 1 to 9. You could directly concatenate each of these digits into one big number, but that wouldn't satisfy the zero-entropy constraint, i.e. there would be plenty of invalid IDs, such as any beginning with "11..." as it would mean that the first row contains two 1's.
So, in order to prevent this from happening, we use a solver. The solver takes an incomplete grid and returns whether or not it can be solved, that is, there is at least one solution possible. E.g. if given the empty (=constraintless) grid, it will return true; if given a grid with 1 1 on one row, it will return false. The solver must not only check for the typical sudoku constraints (= no duplicate digits in the same row/column/square), but must fully solve the grid to make sure there is a possible solution. That's because you could have an incomplete grid that satisfies all sudoku-constraints at first sight, yet has no solution.

So, given our solver, we start constructing the sudoku grid. We start off with the empty grid, and consider the first cell. We want to know (let's assume we're really stupid) what possible digits can be put in this first cell. We just know the general rule that each cell eventually contains a digit from 1 to 9. So, we first fill in 1 in that first cell, and pass it to the solver. It will of course return true. So, we know that 1 is a possible value for the first cell. We do the same for 2, 3, 4, 5, 6, 7, 8 and 9, and thus build a list of possible values for that first cell. We already know that any of these digits can be filled in, so we'll end up with the list [1, 2, 3, 4, 5, 6, 7, 8, 9]. Now, we have to make a choice and write it down, as this part of the sudoku construction is nondeterministic. Let's say we choose 5, this is the option with (0-based) index 4 in our list. So, we write this information down somewhere, namely in a list of pairs: (4, 9), which means "we had 9 choices, of which we chose the 4th in the list".

So, now we have built ourselves a partial grid with 5 in the very first cell. We repeat the same process for each of the remaining 80 cells: we fill in the N-th cell with each of the digits 1..9, and ask the solver "could this lead to a solution?". We build a list of possible digits, make a random choice, and write down this choice as an extra item in our pairs-list.

Note: the order in which the cells are filled in must be fixed. I assume this happens row by row. It could work column by column, or any other seemingly random order, but the order has an effect on the ID-grid association: you can only decode an ID if you already know the order in which the cells were filled in, so it's best to choose one order (e.g. row-by-row, left-to-right) and stick by it.

Our second cell will have option list [1, 2, 3, 4, 6, 7, 8, 9], and let's pretend we choose 2: the pairs-list becomes [(4, 9), (2, 8)]. Next, the third cell will have option list [1, 3, 4, 6, 7, 8, 9], and so on. The ninth cell will be fully constrained, as there will be only one option left, so the 9th entry in our pairs list will be (0, 1), meaning "1 choice, so we went with the one with index 0". The first nine pairs in the list will probably look something like [(<9, 9), (<8, 8), (<7, 7), (<6, 6) (< 5, 5), (< 4, 4), (<3, 3), (< 2, 2), (0, 1)] where "< N" stands for "a number less than N". Note that there could be duplicates in those numbers, e.g. [(0, 9), (0, 8), (0, 7), (0, 6), (0, 5), (0, 4), (0, 3), (0, 2), (0, 1)] is a valid list for the first 9 choices, and represents the choosing of the first option each time, i.e. the grid's upper row will be filled with "1 2 3 4 5 6 7 8 9". Conversely, [(8, 9), (7, 8), (6, 7), (5, 6), (4, 5), (3, 4), (2, 3), (1, 2), (0, 1)] stands for always making the last choice, thus "9 8 7 6 5 4 3 2 1".

In the end, we'll end up with a list of 81 pairs, one for each pair. This list is thus some unique representation of the sudoku grid. As last step towards a unique ID, we need to convert this list to a number. This is done by using a "variable base encoding", or whatever you'd call it. An everyday number, e.g. 475, is written in base 10 (decimal), and a computer number (e.g. 10010110 or 5E12A) is often written in base 2 (binary) or base 16 (hexadecimal). These use a "fixed base", as every digit is multiplied by a power of the same base: 475 = 4 * 10^2 + 7 * 10^1 + 5 * 10^0. This way, given a base, the same value can be encoded in a unique manner. The problem is, one has to know beforehand which base is used. If I give you the number 1000, you technically can't know what value it represents: does it use base 10, base 2, base 5542? Context helps you determine which base is used.

I use variable base, i.e. each digit can be multiplied by a different base power, which only exacerbates the problem: in order to "decode" a number, we have to know beforehand which bases were used in its creation. Fortunately, this poses no problem, as I'll explain later.

So, let's first concentrate on how to convert the list to a unique natural number. Simple:

-- Single argument = pair list
encode [] = 0
encode ((n, b) :: xs) = encode xs * b + n

Examples:

[] == 0
[(5, 8)] == 0 * 8 + 5 == 5
[(2, 5), (3, 8)] == (0 * 8 + 3) * 5 + 2 == 17
[(3, 7), (1, 2)] == (0 * 2 + 1) * 7 + 3 == 10


That's the encoding part :) I warned you it was elaborate. Now for the decoding part...

Well, as a first step, we need a way to convert a natural number to a choice list. As mentioned before, we need the actual bases used for that. Let's postpone that detail to later.

-- First argument = ID
-- Second argument = list of bases
decode n [] = []
decode n (b::bs) = (rest, b) :: decode ((n - rest) / b) bs
where
rest = mod n b --- mod n b == n % b

Examples (inverse of previous ones)

decode 0 [] == []
decode 5 [8] == (5 % 8, 8) :: [] == [(5, 8)]
decode 17 [5, 8] == (17 % 5, 5) :: decode ((17 - (17 % 5)) / 5) [8]
== (2, 5) :: decode 3 [8]
== (2, 5) :: (3 % 8, 8) :: []
== [(2, 5), (3, 8)]
decode 10 [7, 2] == (10 % 7, 7) :: decode ((10 - (10 % 7)) / 7) [2]
== (3, 7) :: decode 1 [2]
== (3, 7) :: (1 % 2, 2) :: []
== [(3, 7), (1, 2)]

We won't be able to use this decode function, as we'll only discover the bases step-by-step, but it's proof that we can, given the bases, reconstruct our choice list from the ID.

Given an ID, how to rebuild our sudoku grid? Well, we start off the same way: we begin at the first cell, try to fill in all digits 1-9 and ask the solver which are valid choices, resulting in an option list for the first cell. The length L of this list is our first base. So, we can partially apply our decoder to find which choice we have to make for this cell: ID % L is this choice. We then "reduce" our ID: (ID - (ID % L)) / B is our new ID, describing the choices made in the next 80 cells.

Thus, we repeat the same step 80 more times: in the N-th cell, we fill in the digits 1-9, ask the solver if it leads to a solution, build our option list, compute its length, partially decode the ID yielding our next choice, reduce our ID, and move on to the next cell.

The nice thing about the encoding is that fully constrained cells (where there's only one option left, such as the 9th cell in a row of which the 8 first cells have already been decided) is that they don't add anything to the number: if we have no choice, our "choice pair" becomes (0, 1), and it has zero effect on our ID, as it multiplies with 1 and adds 0, meaning the skipping of fully constrained cells is done automatically and doesn't need special code.

That's it then. I believe this algorithm to be zero-entropy. If you can find the largest number this encoding can yield, be sure to let me know :)

Share this post


Link to post
Share on other sites
Does this account for symmetries? For example, can't you take any particular solved puzzle and rotate it 90 degrees to get an apparently different puzzle which is actually just a rotation? I don't think the solutions presented thus far will work to eliminate these. Nor do I know whether such rotations need to be eliminated.

-Kirk

Share this post


Link to post
Share on other sites
Quote:
Original post by kirkd
Does this account for symmetries? For example, can't you take any particular solved puzzle and rotate it 90 degrees to get an apparently different puzzle which is actually just a rotation? I don't think the solutions presented thus far will work to eliminate these. Nor do I know whether such rotations need to be eliminated.

-Kirk

Neither of my algorithms take symmetries into account. How to deal with it in the case of the zero-entropy ID, I have no idea. With my hashes, ... well I could take a grid and all its symmetrical forms (horizontal/vertical flips, rotations, permutations, ...) and add them all together. This way, symmetrical forms will have the same hash and will be easily detected.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this