Jump to content
  • Advertisement
Sign in to follow this  
Laval B

Disc Based Hash Table

This topic is 2599 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello everyone

I'm working on an archive file format. Files are stored as consecutive[font="arial, helvetica, clean, sans-serif"] [/font]"records" like in a zip archive but with some different attributes. This is all pretty straightforward. Now i want to add a ToC to the archive for fast lookup of a particular file. Of course, this ToC will be placed at the end of the archive. I was thinking about implementing this ToC as a hash table. I would like to be able to lookup into the hash table directly from the file (at least when accessing the archive in read only mode). I'm aware that adding a file will require me to load the table into memory (or copy it into a temporary file) since it will be overwritten by the new file. It would be great not to have to load the entire ToC into memory.

My question is how would you manage collisions ? I was thinking about linear probing. Each table entry would have an offset to the next entry of the "chain" so if some entry is hashed to the same value as an existing one, the chain would be searched to find the next available spot.

This looks great at first, but what if an entry is hashed to another chain ? I could put it in that chain and just look for it normally i guess.

I doubt this is relevant to the discussion but i'm using C++. I also have a couple of good low collision hash functions for strings and integers. And yes, i know i could use zlib with minizip or other existing solutions but rolling our own format allows us to keep external dependencies to a minimum and to have some interesting features (block compressed streaming and special attributes to name a few).

I would be interested to any feedback and ideas.

Tank you for you replies.

Laval Bolduc

Share this post


Link to post
Share on other sites
Advertisement
When building an archive, you emit an error if there's a collision. Rename the offending file.

Share this post


Link to post
Share on other sites

When building an archive, you emit an error if there's a collision. Rename the offending file.




I would rather not, it would be annoying when using naming conventions for files to have to rename one or two.

Share this post


Link to post
Share on other sites
It sounds like you're thinking that the hash table must be represented on disk the same way it is in memory. This does not have to be the case.
I always prefer the "separate chaining" techinque, and you can represent that on disk by having a count for the number of items in the chain, preceeding those items. Loading is still fast, usage is still fast, disk-wise you actually have a solution that works.

Share this post


Link to post
Share on other sites
[quote name='Antheus' timestamp='1304800549' post='4807792']When building an archive, you emit an error if there's a collision. Rename the offending file.
I would rather not, it would be annoying when using naming conventions for files to have to rename one or two.[/quote]Instead of renaming the offending file, change your hash seed/salt, then hash all the names again. Repeat until there's no collisions. Write the salt/seed as part of the ToC.
When looking up a file-name, use the ToC's seed to hash the name / add the ToC's salt to the name before hashing it, then use the resulting hash when searching the ToC.

You can use this technique across your entire game so that when you ship, all filenames are replaced with perfect hashes.

BTW - Is this for use in a game, or a generic archiver? If it's for a game, you only really need read-only mode and create-from-scratch mode. Supporting editable archives is an unneeded pain.

Share this post


Link to post
Share on other sites
@iMalc :

This sounds like an interesting idea, i never thought about it.

@Hodgman :


I don't know what seed and salt means but i'll read about it. There was an article about hash functions where the word was mention but i didn't pay much attention to it. And yes, it's mostly for a game which makes the read mode very important. It may be possible to write some stuff (like like user preferences and user name for loging) but writting should be rare. I admit that thinking about write mode gives me a headache.

Thank you for your answers, it got me on the right track i think.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!