How to 'guess' best ?

Started by
9 comments, last by Xai 17 years, 1 month ago
OK so if I have a set of data, lets say integers ranging from 0-255. The set can be any size but probably not larger than 512 values.Also there will only be an even number of values. The set will be reduced by half, by averaging 2 adjacent numbers and only saving the averaged value. Given the averaged set data, whats the best method for having an AI figure out the original data? Before I get flamed : this is not homework, just something I'm trying to figure out.
Advertisement
You can't if I am understanding correctly

What is the average of 4 and 6?
What is the average of 2 and 8?

Answer these and you will have your answer :)
There is no way to solve this problem with any form of accurate guess. Any answer you come up with is equally likely given the information. Why? Because you have N initial numbers but only N-1 constraint equations. The system is under-determined, meaning there are infinitely many solutions and hence any answer you guess is as good as any other, so your guess is worthless.
If the purpose of this is simply to take a whole bunch of integers and squish them down in size, and performance doesn't mean anything, and you really really really want to save the extra hundred bytes (don't see why)...

Then you could take those averaged values and have a seperate vector with values representing the absolute value of the difference between one of the averaged values and the average itself. Basically, for every unsigned int pair:

unsigned int src: 2, 6
unsigned int avg: 4
...unsigned char difference: 2

| 6 - 2 | == | 2 - 4 | == 2. That way you're saving a weeeee bit of storage space (could be more if you have a couple of hundred thousand ints, however...)

Then again, if everything is a value between 0 and 255, everything fits in a byte anyways which makes my entire explanation of my idea before entirely, uh, nullified. Err. Useless.

What's the purpose for this anyways?
*After Argument*P1: What was it about?P2: Something involving a chair, a cat, and a rubber ducky...
When you generate the set, make it so theyre randomly generated with normal distribution (i.e. the middle values have a much higher chance of appearing) and thjen you can make it so your a.i. finds x solutions in main distribution, x-1 in 1 standard deviation over, x-2 for 2 standard deviations, etc.

Then you'll have a reasonable "guess" on the solution.
Well I plan on using CRC hashes or some other error checking codes to use to compare guess values to , so in theory only the correct guess would have the same hash as the original data.
What is the purpose of this, is it to compress data or is there another reason?
The idea is to compress data yes, though uncompressing the data might take a rather long time ( which is what Im trying to figure out a better way of doing)
Quote:Original post by vaneger
Well I plan on using CRC hashes or some other error checking codes to use to compare guess values to , so in theory only the correct guess would have the same hash as the original data.

Hashes don't work that way - there will *always* be collisions between hashes (ie. two or more pieces of input data which yeild the same hash). A good hashing function can minimise collisions, particularly from similar data, but they can't eliminate it entirely.
Unfortunately, the checksum/hash idea doesn't hold water. For a message 2N bytes in length, you'll have N bytes after you've averaged the data and a checksum of say 128 bits (8 bytes). You'll therefore have 2^(8*N) possible original messages and only 2^128 bits of information in your checksum. I'm being a bit vague about this, but the point is that the amount of information you have matters. MD5 checksum for instance is believed to require an average of 2^128 plaintexts to be tested to try and produce a given hash. Unfortunately you're going to have to try many, many more plaintexts than this because for any decent kind of hash/checksum it isn't possible to isolate the effects of changes to the plaintext.

Compression generally works in one of two ways: first, finding that some symbols (bytes, integers, whatever) occur more frequently than others and then encoding them using fewer bits (look up Huffman encoding) - however the price you pay is that you have to store the code as well. The other method is the ZIP method where strings of symbols are encoded as a single symbol if they occur more than once in your data. ZIP actually generates the symbol table on the fly and uses Huffman coding too, but that's the basic idea.

Your method is more useful to something like audio or video encoding. Some systems work by having a method to guess what the next sample in the stream is, and then encode the corrections (usually using Huffman encoding or similar). I guess is would be similarly possible to construct something like that for compressing English text since as I understand it English has an entropy of about 1.5 bits per character (so you should be able to get about 5:1 compression in theory).

This topic is closed to new replies.

Advertisement