hash code for vector of integers

Started by
12 comments, last by choffstein 18 years, 2 months ago
I have a vector of integers that I am trying to store in a std::map. After the first element, all of the elements are sorted (though, there are many repeats). An example vector is: 47,6,6,6,7,7,8,8,9,9 Currently, I am using a stringstream, reading in all of the ints, and exporting a string. However, this function runs A LOT. My profiler is fscked, so I can't tell if it is causing a bottleneck, but I was wondering if anyone can come up with any other hash code ideas for a vector of ints. Hash codes should be equal when one vector equals another. I was thinking of passing around char * instead of std::string, but then I realized that the char * should become invalid the second the string stream goes out of scope, so I don't think that is an option (unless the ss.str().c_str() would return a copy...) Thanks.
Advertisement
Quote:Original post by visage
Hash codes should be equal when one vector equals another.


This is impossible to guarantee without some kind of special knowledge about the data. Hash codes are usually fixed in length so obviously if your data can have more combinations than the hash code then the possibility of collisions must exist. I can't really see why you couldn't just put the vector into the map (although this might hit your performance because it would need to be copied once for every insertion operation). If you're just looking for a way to shrink the data some kind of run length encoding looks like it would be appropriate here.
You need to use std::multimap if you plan on having a key being used more than once. Also, do you have a value you are storing in the map along with the key? If not maybe std::set is better for your needs.

I'm not very clear on exactly what you're doing to comment on the speed question.
I have a class called "Configuration" which holds the vector. I cant just insert the vector, because Configuration has other data in it (parent configurations). However, configuration equality is based on the vector's data.

So I am passing in a hash based on the vector and a pointer to the configuration as the data.

Obviously I can't use a set on the pointer to the configuration (without some sort of comparator...)

Quote:
This is impossible to guarantee without some kind of special knowledge about the data.

I don't know why you say this. This is the exact specification that Java has (when obj.equals(obj2), obj.hashCode() = obj2.hashCode) -- I am just bringing it to my implementation.

I don't want to pass in the vector because this is a subclass of a configuration type. Other configuration sub classes don't use vectors, but are included in the same map. Thus, I have to figure out hashing for each type.

Though, I suppose the other configurations could easily be changed to vectors...

Thanks so far.
What is more important hash computation speed or hash distribution?

A simple scheme is XORing all elements.

a nice text comparing different hashes
Quote:Original post by visage
Quote:
This is impossible to guarantee without some kind of special knowledge about the data.

I don't know why you say this. This is the exact specification that Java has (when obj.equals(obj2), obj.hashCode() = obj2.hashCode) -- I am just bringing it to my implementation.


Good point - sorry, not reading that carefully enough (I was thinking of the reverse implication which of course isn't true).
So let me get this right. You're currently calculating a string from the vector and using that as the hash?

If so, that's quite a shocker for performance really. You've got memory allocation/deallocation of the strings. Resizing of strings as you build the hash. Presumably long strings to compare...

Why not calcluate the CRC of the integers in the vector, and use the result as an integer hash?
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Well, I recognize that iMalc -- hence why I asked ;)
In truth, the hash distribution is more important. It must be absolutely unique for each vector that is not equal (which is almost impossible to ask without doing the method I am using).

Ill try the CRC method. Thanks.
CRC hash didn't work so well. I had to force my vector of ints into a vector of chars, then pass that along to the crc, which didn't do so great a job.

Maybe a fast function with a multimap?
Why are you using hashes together with std::map or std::multimap? Lexicographic comparison always is faster than hashes...

This topic is closed to new replies.

Advertisement