Sign in to follow this  
vaneger

How to 'guess' best ?

Recommended Posts

OK so if I have a set of data, lets say integers ranging from 0-255. The set can be any size but probably not larger than 512 values.Also there will only be an even number of values. The set will be reduced by half, by averaging 2 adjacent numbers and only saving the averaged value. Given the averaged set data, whats the best method for having an AI figure out the original data? Before I get flamed : this is not homework, just something I'm trying to figure out.

Share this post


Link to post
Share on other sites
There is no way to solve this problem with any form of accurate guess. Any answer you come up with is equally likely given the information. Why? Because you have N initial numbers but only N-1 constraint equations. The system is under-determined, meaning there are infinitely many solutions and hence any answer you guess is as good as any other, so your guess is worthless.

Share this post


Link to post
Share on other sites
If the purpose of this is simply to take a whole bunch of integers and squish them down in size, and performance doesn't mean anything, and you really really really want to save the extra hundred bytes (don't see why)...

Then you could take those averaged values and have a seperate vector with values representing the absolute value of the difference between one of the averaged values and the average itself. Basically, for every unsigned int pair:

unsigned int src: 2, 6
unsigned int avg: 4
...unsigned char difference: 2

| 6 - 2 | == | 2 - 4 | == 2. That way you're saving a weeeee bit of storage space (could be more if you have a couple of hundred thousand ints, however...)

Then again, if everything is a value between 0 and 255, everything fits in a byte anyways which makes my entire explanation of my idea before entirely, uh, nullified. Err. Useless.

What's the purpose for this anyways?

Share this post


Link to post
Share on other sites
When you generate the set, make it so theyre randomly generated with normal distribution (i.e. the middle values have a much higher chance of appearing) and thjen you can make it so your a.i. finds x solutions in main distribution, x-1 in 1 standard deviation over, x-2 for 2 standard deviations, etc.

Then you'll have a reasonable "guess" on the solution.

Share this post


Link to post
Share on other sites
Well I plan on using CRC hashes or some other error checking codes to use to compare guess values to , so in theory only the correct guess would have the same hash as the original data.

Share this post


Link to post
Share on other sites
The idea is to compress data yes, though uncompressing the data might take a rather long time ( which is what Im trying to figure out a better way of doing)

Share this post


Link to post
Share on other sites
Quote:
Original post by vaneger
Well I plan on using CRC hashes or some other error checking codes to use to compare guess values to , so in theory only the correct guess would have the same hash as the original data.

Hashes don't work that way - there will *always* be collisions between hashes (ie. two or more pieces of input data which yeild the same hash). A good hashing function can minimise collisions, particularly from similar data, but they can't eliminate it entirely.

Share this post


Link to post
Share on other sites
Unfortunately, the checksum/hash idea doesn't hold water. For a message 2N bytes in length, you'll have N bytes after you've averaged the data and a checksum of say 128 bits (8 bytes). You'll therefore have 2^(8*N) possible original messages and only 2^128 bits of information in your checksum. I'm being a bit vague about this, but the point is that the amount of information you have matters. MD5 checksum for instance is believed to require an average of 2^128 plaintexts to be tested to try and produce a given hash. Unfortunately you're going to have to try many, many more plaintexts than this because for any decent kind of hash/checksum it isn't possible to isolate the effects of changes to the plaintext.

Compression generally works in one of two ways: first, finding that some symbols (bytes, integers, whatever) occur more frequently than others and then encoding them using fewer bits (look up Huffman encoding) - however the price you pay is that you have to store the code as well. The other method is the ZIP method where strings of symbols are encoded as a single symbol if they occur more than once in your data. ZIP actually generates the symbol table on the fly and uses Huffman coding too, but that's the basic idea.

Your method is more useful to something like audio or video encoding. Some systems work by having a method to guess what the next sample in the stream is, and then encode the corrections (usually using Huffman encoding or similar). I guess is would be similarly possible to construct something like that for compressing English text since as I understand it English has an entropy of about 1.5 bits per character (so you should be able to get about 5:1 compression in theory).

Share this post


Link to post
Share on other sites
To put a little more on ZQJ's post:

Simple encoding compression works if there are less total symbols actually used in the message, than those which can in theory be used (for instance if you are on a 32 bit integral machine and only about 2000 of the possible 32 bit numbers ever appear in your string, you have a 2000 long table of the ints, and then sequences of 6 bits instead of 32. This is simple coding.

ZQJ's The first 2 paragraphs are another idea - at 1 or more level of viewing, take the most common data and represent it in the least amount of space, using more space for less common data. This is basically an abstraction of Huffman encoding. Also it can be applied at multiple levels (huffman encode bit patters, byte patterns, ints, english words, whatever).

ZQJ's 3rd paragraph is another idea, which is actuall 2 ideas mixed, the first is function modeling. If your data follows a function, saying the function is easier than saying the data. For instance if I want to play a certain game of FreeCell I type a 32 bit number in, it uses the number to seed a function to recreate the data. If your data almost or sometimes follows a function, saying the differences can help (as he suggests).

The more and more domain knowledge you can bring to bear, the more function modeling can likely make sense. For instance in 256 bit color graphics, the same color is almost always next to itself, so run-length encoding makes sense. In photographs, similar color tones are almost always near each other, so delta (difference) encoding makes sense. In audio data generally follows a function over short stretches, so the function description may save space, etc. So data analysis is the key to superior data compression.

There is theoretically no such thing as a universally benifitial encryption that can reduce the space of any set of data. It is fundamentally impossible to represent 1 million possibilities in less 20 bits of data, 10 choices of yes or no will take 10 bits of data, nothing can be done about it. No amount of knowledge can store all arbitrary sequences of 10 boolean choices in less than 10 binary bits.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this