Hash Tables - What are they for?

Started by
9 comments, last by jpetrie 16 years ago
I recently read about Hash Tables in a Data Structures book I got, but I'm having trouble seeing any real purpose beyond data encryption. I asked a friend of mine, and he told me that a hash table is just a quick lookup table independent of actual pigeonholing hashing functions, but I can't figure out what the difference (in that case) would be from that description and an array with an enum. Is there a difference?
Advertisement
Quote:
but I'm having trouble seeing any real purpose beyond data encryption

Hashing has a place in encryption, but when used in the hash table data structure alone, devoid of any other context, there is no encryption involved.

Quote:
I can't figure out what the difference (in that case) would be from that description and an array with an enum. Is there a difference?

Yes. The key point you're missing is that the hash function transforms the key type (your enum, in this case) to an integer representing which hash bucket the value associated with the key will be stored (the type of the value is immaterial for this discussion).

Indexing into an array isn't the same -- what happens if the range of your index was large (millions) but you expected only a few sparse entries at a time? What if your key type was a non-integer (say, a string?)
I want to store a bunch of names with IDs. I don't know the names or ID until they are given to me, but i still want a fast way to
check if i know the name and ID, and a fast way to insert a new name and ID.

With arrays, the search is O(n) insertion O(1) OR search O(lg n) insert O(lg n)[?] if i want to keep a sorted array and binary search it.
With a hash, both insert and search are O(1) [not counting collisions].

This could be used for many things, be them player ids for player data, or data strings in a dictionary compressor.
But as noted by the above poster, hash tables work best for sparse datasets.
say you want to map names to phone numbers. You can't hard code everyone's name into an enum, so you use a hash table. Say you have 6 million items in an array searching/sorting is really slow. On a hash table search can be a constant time operation.
Quote:Original post by jpetrie
Indexing into an array isn't the same -- what happens if the range of your index was large (millions) but you expected only a few sparse entries at a time? What if your key type was a non-integer (say, a string?)


Couldn't you just use an enum with your array access, or something like that? Unless this is the form of the actual hash table (the book wasn't really clear on that) I would think this would suffice:

Pointer -> Index -> Data
Pointer -> Index -> Data
...
Hash-table practice problem:

You have a census database containing 300 million people. Each person has a unique SSN (social security number). SSN are, for all practical reasons, random, unique 512-bit (64-byte) numbers. Any SSN is valid, but only 300 million are assigned.

When a new person is born, they are assigned a new SSN. Your task is to generate a new SSN that is not already in use. How will you do that? How long will your algorithm take? How much memory will it use?
Quote:
Couldn't you just use an enum with your array access, or something like that?

You really think you can represent your strings with enums?

What about sparse data sets? Even if the key type is an integer, and thus could be easily represented as an enum, what if there are ten thousand entries? What if you only expect to use fifty or sixty during a typical run of the program? You're wasting nine hundred and forty thousand array slots.
Quote:Original post by Antheus
Hash-table practice problem:

You have a census database containing 300 million people. Each person has a unique SSN (social security number). SSN are, for all practical reasons, random, unique 512-bit (64-byte) numbers. Any SSN is valid, but only 300 million are assigned.

When a new person is born, they are assigned a new SSN. Your task is to generate a new SSN that is not already in use. How will you do that? How long will your algorithm take? How much memory will it use?


Does it need to be random, or can we give out sequential SSNs? That is, just increment by one for each person. Seems rather, um, insecure though. It would use constant time, constant memory (64bytes, to be precise) and very little time.

That would work fine until we get to 18 quadrillion people, but the social security system will have gone bankrupt long before then. [wink]
Quote:Original post by Ezbez


Does it need to be random, or can we give out sequential SSNs? That is, just increment by one for each person. Seems rather, um, insecure though. It would use constant time, constant memory (64bytes, to be precise) and very little time


No, it's random. If you need a reason, then assume that several governments changed, various dewey-decimal based systems didn't assume for twins, records were lost, and so on.

As such, given that this is a legacy system, your numbers are random.

BTW: real SSN and personal ID numbers, as well as many such systems are often problematic for this very reason. Even if the system supposedly guarantees each number is unique, you're very likely to find out two people with same ID. Credit card and insurance companies have often problems with that (poor design of applications and databases, but it happens in real world, in Big Money).

When dealing with databases, you'll find that some people distrust auto-increment, and prefer to roll their own. Result is databases with completely messed up ID schemes, where, for integration purposes, you have no choice but to find empty spots.

So yes, it's quite a real world example. Clean, sequential, auto-incrementing IDs are great, but often impossible. And yes, as mentioned, incrementing by one has huge security and privacy implications, since it's possible to guess a lot about context.

Incrementally generated SSNs for example would be illegal by law in contexts where age discrimination is disallowed, since you could guess down to an hour when a person was born.
Quote:Original post by Antheus
Hash-table practice problem:

You have a census database containing 300 million people. Each person has a unique SSN (social security number). SSN are, for all practical reasons, random, unique 512-bit (64-byte) numbers. Any SSN is valid, but only 300 million are assigned.

When a new person is born, they are assigned a new SSN. Your task is to generate a new SSN that is not already in use. How will you do that? How long will your algorithm take? How much memory will it use?


I'd probably just make linked lists of every element in an array. That is, each element would be a pointer to a SSN, and each element pointed to by an identifier (I know, I'm not using the term properly).

"John Doe" -> a[1] -> 195484222
"Jane Doe" -> a[2] -> 60958542
...

I think access would be O(n), not really sure about searches.

Quote:Original post by jpetrie
You really think you can represent your strings with enums?

What about sparse data sets? Even if the key type is an integer, and thus could be easily represented as an enum, what if there are ten thousand entries? What if you only expect to use fifty or sixty during a typical run of the program? You're wasting nine hundred and forty thousand array slots.


Sorry if I wasn't clear. I was referring to using an enum to represent ints as a string, though thats a localized case that wouldn't work in all situations.

Maybe I'm missing something here, but every diagram I can find of a Hash Table involves huge arrays with only a few keys, which I'm sure we'll both agree defeats the purpose.

This topic is closed to new replies.

Advertisement