Disc Based Hash Table

Started by
4 comments, last by Laval B 12 years, 11 months ago
Hello everyone

I'm working on an archive file format. Files are stored as consecutive[font="arial, helvetica, clean, sans-serif"] [/font]"records" like in a zip archive but with some different attributes. This is all pretty straightforward. Now i want to add a ToC to the archive for fast lookup of a particular file. Of course, this ToC will be placed at the end of the archive. I was thinking about implementing this ToC as a hash table. I would like to be able to lookup into the hash table directly from the file (at least when accessing the archive in read only mode). I'm aware that adding a file will require me to load the table into memory (or copy it into a temporary file) since it will be overwritten by the new file. It would be great not to have to load the entire ToC into memory.

My question is how would you manage collisions ? I was thinking about linear probing. Each table entry would have an offset to the next entry of the "chain" so if some entry is hashed to the same value as an existing one, the chain would be searched to find the next available spot.

This looks great at first, but what if an entry is hashed to another chain ? I could put it in that chain and just look for it normally i guess.

I doubt this is relevant to the discussion but i'm using C++. I also have a couple of good low collision hash functions for strings and integers. And yes, i know i could use zlib with minizip or other existing solutions but rolling our own format allows us to keep external dependencies to a minimum and to have some interesting features (block compressed streaming and special attributes to name a few).

I would be interested to any feedback and ideas.

Tank you for you replies.

Laval Bolduc
We think in generalities, but we live in details.
- Alfred North Whitehead
Advertisement
When building an archive, you emit an error if there's a collision. Rename the offending file.

When building an archive, you emit an error if there's a collision. Rename the offending file.




I would rather not, it would be annoying when using naming conventions for files to have to rename one or two.
We think in generalities, but we live in details.
- Alfred North Whitehead
It sounds like you're thinking that the hash table must be represented on disk the same way it is in memory. This does not have to be the case.
I always prefer the "separate chaining" techinque, and you can represent that on disk by having a count for the number of items in the chain, preceeding those items. Loading is still fast, usage is still fast, disk-wise you actually have a solution that works.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
[quote name='Antheus' timestamp='1304800549' post='4807792']When building an archive, you emit an error if there's a collision. Rename the offending file.
I would rather not, it would be annoying when using naming conventions for files to have to rename one or two.[/quote]Instead of renaming the offending file, change your hash seed/salt, then hash all the names again. Repeat until there's no collisions. Write the salt/seed as part of the ToC.
When looking up a file-name, use the ToC's seed to hash the name / add the ToC's salt to the name before hashing it, then use the resulting hash when searching the ToC.

You can use this technique across your entire game so that when you ship, all filenames are replaced with perfect hashes.

BTW - Is this for use in a game, or a generic archiver? If it's for a game, you only really need read-only mode and create-from-scratch mode. Supporting editable archives is an unneeded pain.
@iMalc :

This sounds like an interesting idea, i never thought about it.

@Hodgman :


I don't know what seed and salt means but i'll read about it. There was an article about hash functions where the word was mention but i didn't pay much attention to it. And yes, it's mostly for a game which makes the read mode very important. It may be possible to write some stuff (like like user preferences and user name for loging) but writting should be rare. I admit that thinking about write mode gives me a headache.

Thank you for your answers, it got me on the right track i think.
We think in generalities, but we live in details.
- Alfred North Whitehead

This topic is closed to new replies.

Advertisement