[C++] Problems designing a proper hash table w/ chaining

deus.ex.nova

569

Author

December 09, 2012 01:41 AM

Hey Guys,

I'm trying to design a my own hash table data structure using chaining, but my main issue is designing flexible components that can serve my hash table class. Before anyone asks why I'm trying to make my own hash table, this is part of an assignment for one of my classes, and I figured that a custom hash table would help in future game projects (I heard hash tables were commonly used for managing game assets).

My hash table contains a sorted linked list class that I also created for a previous assignment, but I'm trying to figure out how I would deal with a look up in the hash table. My first idea was to implement a find function in my list class, but I can't figure out how to design it so that it is flexible enough for my hash table to use (i.e. if my hash table decided to do a look up with a string rather instead of a number, and vice versa). Please feel free to critique my class design. This is all a learning process so I would love to hear from more experienced programmers.

*EDIT* Just added a "Front()" function to provide access to the first node in the list.

Here is my list class. My get function should be returning a ptr to the data object, right?:



/************************************************************************

* SortedList class template.

* Will manage and sort inherent C++ types.

* *NOTE* Overloaded operator<() must be available in custom

* class objects in order for sorting to work properly.

************************************************************************/

template <typename T, typename Node = ListNode<T>>

class SortedList

{

public:

/** Class constructor.

  */

SortedList();



/** Class destructor.

  * Deallocates heap memory generated by the class in addition to

  * destroying the class instance.

  */

~SortedList();



/** List manipulator.

  *  Inserts an item to the list in sorted order.

  *  @param item An item of the same type as the list.

  */

void Insert(const T &newItem);



/** Iterator

  * Returns a ptr to the beginning of the list.

  */

Node* Front();



/** List manipulator.

  * Empties the entire list i.e. pops off all items.

  */

void Clear();



/** Utility function.

  * Returns true if the list is empty, else returns false.

  */

bool IsEmpty() const;



/** Utility function.

  * Returns the number of elements in the list.

  */

int Size() const;



private:

unsigned int m_count; ///< Keeps track of how many items in the list.

Node<T> *m_pFront;   ///< Points to the front of the list.



/** List manipulator.

  * Takes a an item off the front of the list, and returns it.

  * @throw (char*) Throws a char* exception if called against an empty list.

  */

T Pop();

};



/************************************************************************

* Method Definitions

************************************************************************/

template <typename T, typename Node>

SortedList<T, Node>::SortedList()  : m_count(0), m_pFront(nullptr) {}



template <typename T, typename ListNode>

SortedList<T, Node>::~SortedList()

{

Clear();

}



template <typename T, typename Node>

void SortedList<T, Node>::Insert(const T &newItem)

{

// Make new node the front if it has the 'lowest' value, or list is empty

if (m_pFront == nullptr || newItem < *(m_pFront->Data()))

  m_pFront = new Node(newItem, m_pFront);

else

{

  Node *pLinkingNode = m_pFront; // newNode will be pointed to by this.

  // Keep iterating until we find the 'in-order' pos, until we hit the end.

  while (pLinkingNode->Next() != nullptr &&

   *(pLinkingNode->Next()->Data()) < newItem )

  {

   pLinkingNode = pLinkingNode->Next();

  }

  // Insert the node in its appropriate position

  Node *pNewNode = new Node(newItem, pLinkingNode->Next());

  pLinkingNode->SetNext(pNewNode);

}

m_count++;

}



template <typename T, typename Node>

inline Node* SortedList<T, Node>::Front() const

{

return m_pFront;

}



template <typename T, typename Node>

void SortedList<T, Node>::Clear()

{

while (m_count != 0)

  Pop();

}



template <typename T, typename Node>

inline bool SortedList<T, Node>::IsEmpty() const

{

if (m_count == 0)

  return true;

else

  return false;

}



template <typename T, typename Node>

inline int SortedList<T, Node>::Size() const

{

return m_count;

}



template <typename T, typename Node>

void SortedList<T, Node>::Print() const

{

if (m_count == 0)

{

  std::cout << "There are no items in the list.\n";

}

else

{

  Node* pCurrentNode = m_pFront;

  for (unsigned int i = 0; i < m_count; i++)

  {

   std::cout << *(pCurrentNode->Data()) << std::endl;

   pCurrentNode = pCurrentNode->Next();

  }

}

}



template <typename T, typename Node>

T SortedList<T, Node>::Pop()

{

// Proceed with pop unless the list is empty

if (m_count > 0)

{

  T tempItem = *(m_pFront->Data());  // Hold item to be returned

  // Preserve pointer to node after the front and delete the current front

  Node *pTempNode = m_pFront;

  m_pFront = m_pFront->Next();

  delete pTempNode;

  m_count--;

  return tempItem;

}

else

{

  throw std::runtime_error("Error: Cannot pop an item from an empty list!");

}

}

Here is my default node structure for my list (Is it overkill to encapsulate the data members of the node?):



template <typename T>

class ListNode

{

public:

/** Class constructor.

  * @param item  Object to store inside the node.

  * @param pNext Ptr to the following node.

  */

ListNode(const T &item, ListNode *pNext = nullptr)

  : m_data(&item), m_pNext(pNext) {}



/** Class destructor

  * Deallocates heap memory generated by the class in addition to

  * destroying the class instance.

  */

~ListNode() {}



/** Getter function.

  * Returns a ptr to the object stored in the node.

  */

T* Data() const { return &m_data; }



/** Getter function.

  * Returns a ptr to the next node.

  */

ListNode* Next() const { return m_pNext; }



/** Setter function.

  * Changes the next node pointed to.

  * @param pNext Ptr to new node.

  */

void SetNext(ListNode *pNext) { m_pNext = pNext; }



protected:

T m_data;   ///< Stores node's value.

ListNode *m_pNext; ///< Points to next node, if any.

};

Lastly, here is the current design of my hash table class. I haven't actually implemented anything here yet, since I'm still redesigning the list class:



template <typename Data, typename Key>

class ChainedHashTable

{

public:

ChainedHashTable( int size = 11 );

~ChainedHashTable();



/** Hashing func.

  * Returns the hash value according to a hash key.

  * @param key The seed value used to hash.

  */

int Hash( Key key ) const;



/** Insertion func.

  * Use this to add items to the hash table.

  * @param item Const reference to object being added to the table.

  */

void Insert( const Data &data, Key key );



/** Search func.

  * Returns the index value of the object in question.

  */

Data* LookUp( Key key ) const;



private:

std::vector<SortedList> *m_pTable;



/** Collision resolution func.

  * Inserts data into the hashed chain, in order.

  * @param index Value of the index where the collision occurred.

  */

void Resolve( int index ) const;

};

the_edd

2,109

December 09, 2012 03:18 AM

My hash table contains a sorted linked list class that I also created for a previous assignment,

Does it need to be sorted? If the load factor is low enough (or adaptive/user-modifyable), this is quite possibly a performance pessimization. It shouldn't affect correctness though, but it does mean any key that is put in to the table needs to support operator<, std::less, or whatever your sorted list uses. This requirement is abnormal for a hash table, at least in my experience.

but I'm trying to figure out how I would deal with a look up in the hash table.
[/quote]
Well, you need a generic function of some kind that returns a hash for a key. You might use a traits mechanism to allow custom hashing for arbitrary types. Sensible default traits could be provided for common key types (strings, integers, etc). If you don't know what a "traits mechanism" is, feel free to ask.

The other thing you need is an equality comparison for the keys. operator== should suffice for starters, as it can be overloaded for arbitrary types.

My first idea was to implement a find function in my list class, but I can't figure out how to design it so that it is flexible enough for my hash table to use (i.e. if my hash table decided to do a look up with a string rather instead of a number, and vice versa).
[/quote]
Well you only need to find by comparing keys. Again, using operator== is probably a good starting point. Although, presumably you'll need to keep a sorted list of something like std::pair<Key, Value> and using operator== on a pair won't do the right thing.

If your list wasn't sorted, you could expose an iterator interface, like std::list, so that you can use std::find_if() with an appropriate predicate. But because it is sorted, I can see the benefit of providing a find_if() method. It could take a predicate, like the generic std::find_if() function.

Here is my list class. My get function should be returning a ptr to the data object, right?:
[/quote]
Which get function? I don't see one.

I skim-read the code. Here are some comments.

template <typename T, typename Node = ListNode<T>> class SortedList { public: [...] Node* Front();
[/quote]

Do you want to expose the nodes? Or just the values they contain? By returning Node objects, the user can "accidentally" modify any next/prev pointers.

The only reason I can think of to provide my own node type would be for "intrusive links", but your SortedList doesn't seem to allow that (which is fine). Is the Node template parameter necessary?

int Size() const;[/quote]
Is int the best type to use here?

T Pop();[/quote]
I'm just going to post this slightly insidious link. It's probably a bit mean at this stage, but if you can follow it and realise why I've posted it after your Pop() method, you're doing well

I note that you haven't provided a copy constructor or assignment operator. If you don't want to do that at this stage, you should explicitly make your class uncopyable. The way you do this in C++98 (private, unimplemented methods) and C++11 (putting "= delete" at the end of the method declarations) is different though the former way will still work.

template <typename T, typename Node> inline bool SortedList<T, Node>::IsEmpty() const { if (m_count == 0) return true; else return false; }[/quote]

Personally, I find "return m_count == 0;" to be more idiomatic, but that's a minor thing.

template <typename T, typename Node> void SortedList<T, Node>::Print() const { if (m_count == 0) { std::cout << "There are no items in the list.\n"; } else { Node* pCurrentNode = m_pFront; for (unsigned int i = 0; i < m_count; i++) { std::cout << *(pCurrentNode->Data()) << std::endl; pCurrentNode = pCurrentNode->Next(); } } }[/quote]
Useful for debugging I suspect, but slightly odd (though maybe it was a requirement for the assignment?).

What if I want to comma-separate my output? What If I want to print it to a string or a file? If you provided an iteration interface, I could do this.

Here is my default node structure for my list (Is it overkill to encapsulate the data members of the node?):
template <typename T> class ListNode { public: /** Class constructor. * @param item Object to store inside the node. * @param pNext Ptr to the following node. */ ListNode(const T &item, ListNode *pNext = nullptr) : m_data(&item), m_pNext(pNext) {} /** Class destructor * Deallocates heap memory generated by the class in addition to * destroying the class instance. */ ~ListNode() {}[/quote]
Technically a destructor call doesn't release the memory associated with *this, so the comment isn't entirely accurate.

/** Getter function. * Returns a ptr to the object stored in the node. */ T* Data() const { return &m_data; }[/quote]
IMHO, this should return a reference. In situations where it's possible for there not to be a value, I can understand the choice to return a pointer. But that's not the case here.

protected: T m_data; ///< Stores node's value. ListNode *m_pNext; ///< Points to next node, if any.[/quote]
Prefer 'private' to 'protected', for a couple of reasons.

First, it's easier to change things from private to protected than the other way around. If something starts out protected and later you decide you really did mean private, then you might have other code depending on that member which you'll now have to modify. However, if you start out with private, switching to protected won't break any other code. But do you ever envision that someone will actually inherit your ListNode class, which is afterall the only reason we'd ever want to make something protected? If the answer is 'no', then private is definitely the way to go. If that assumption is wrong (and I doubt it is), we can easily change to protected later. However...

'protected' is really much closer to 'public' than it is to 'private', as it still allows access to parts of your class. The code that's actually accessing the protected stuff is probably a different audience, but that's the only difference. So you shoud still try provide a well-thought-out interface, just like you do with public stuff, where implementation details are hidden to a practical degree.

In this particular case, you can already access the data through Data() and get/set the next node, so there's really no need for protected at all.

I think ListNode should probably be un-copyable.

template <typename Data, typename Key> class ChainedHashTable[/quote]
It's more common to have the key appear first, though no biggie.

{ public: ChainedHashTable( int size = 11 );[/quote]
11?
Is int the best type for the size?
Do you know about explicit constructors?

int Hash( Key key ) const;[/quote]
Is int the best type for the return value?
Should key be passed by reference-to-const?

void Insert( const Data &data, Key key );[/quote]
Again, it's normal for the key to come first.
Should key be passed by reference-to-const?

Data* LookUp( Key key ) const;[/quote]
Should key be passed by reference-to-const?
Should the return type be const-qualified?

private: std::vector<SortedList> *m_pTable;[/quote]
Perhaps a typo, but do you really want a pointer to a vector, here?

/** Collision resolution func. * Inserts data into the hashed chain, in order. * @param index Value of the index where the collision occurred. */ void Resolve( int index ) const;[/quote]
I don't understand this function, or its description. Inserts what data into the hashed chain? Missing a parameter, perhaps?
Should the method be const?

http://www.mr-edd.co.uk
http://bitbucket.org/edd

the_edd

2,109

December 09, 2012 03:23 AM

Grrrr. Yet again, I hit "Quote" rather than "Edit". Sorry for the noise.

http://www.mr-edd.co.uk
http://bitbucket.org/edd

deus.ex.nova

569

Author

December 09, 2012 04:52 AM

Does it need to be sorted?

That was a requirement from my professor. I don't understand the necessity of the values being sorted either, since I will have to check every node until I find the item I'm looking for anyways. The list need to be ordered by their keys, I believe. Keys are only ever numbers and strings, right?

Well, you need a generic function of some kind that returns a hash for a key. You might use a traits mechanism to allow custom hashing for arbitrary types. Sensible default traits could be provided for common key types (strings, integers, etc).

The other thing you need is an equality comparison for the keys. operator== should suffice for starters, as it can be overloaded for arbitrary types.

Could you clarify what you mean by a "traits mechanism"? I'm unfamiliar with the concept. Perhaps a simple example would clear things up. I do indeed need an overloaded operator==(), only to compare the key held inside the data object though.

Which get function? I don't see one.

I meant in the ListNode class, line 21. Sorry, I wrote my question on the wrong block of code.

Do you want to expose the nodes? Or just the values they contain? By returning Node objects, the user can "accidentally" modify any next/prev pointers.

The only reason I can think of to provide my own node type would be for "intrusive links", but your SortedList doesn't seem to allow that (which is fine). Is the Node template parameter necessary?

I can't seem to decide, because I don't have enough experience to know if I would ever need to expose the node or not. When would exposing the node's data object be an issue in game programming? I definitely don't want the 'next' pointer to be handled by anything but the data structure it is serving (i.e. list).

Quote
?
1
int Size() const;
Is int the best type to use here?

Should I be using something less expensive than that? Short int? Or something with higher resolution? Since hash tables apparently can get pretty crazy in size, I thought int would an okay data type for size, but please explain why you think otherwise. My brain is a sponge right now.

Quote
?
1
T Pop();
I'm just going to post this slightly insidious link. It's probably a bit mean at this stage, but if you can follow it and realise why I've posted it after your Pop() method, you're doing well

I note that you haven't provided a copy constructor or assignment operator. If you don't want to do that at this stage, you should explicitly make your class uncopyable. The way you do this in C++98 (private, unimplemented methods) and C++11 (putting "= delete" at the end of the method declarations) is different though the former way will still work.

I will get back to you regarding the Pop() function after I finish reading the article.

A copy constructor/assignment operator for my list and hash table would be a good idea.

I seemed to have forgotten some of the basics. Thanks.

..."return m_count == 0"... lol duh

Useful for debugging I suspect, but slightly odd (though maybe it was a requirement for the assignment?).

What if I want to comma-separate my output? What If I want to print it to a string or a file? If you provided an iteration interface, I could do this.

Yes a print function was a requirement for a previous assignment, but I just made a poor design decision at the time and should have left the print implementation outside the class.

Technically a destructor call doesn't release the memory associated with *this, so the comment isn't entirely accurate.

I see. TIL++.

/** Getter function.
* Returns a ptr to the object stored in the node.
*/
T* Data() const { return &m_data; }
IMHO, this should return a reference. In situations where it's possible for there not to be a value, I can understand the choice to return a pointer. But that's not the case here.

Wouldn't I have to feed the function an external variable then? I can't think of any other way to avoid making the function more complicated. Is the extra parameter just a necessary evil?

Prefer 'private' to 'protected', for a couple of reasons.

First, it's easier to change things from private to protected than the other way around. If something starts out protected and later you decide you really did mean private, then you might have other code depending on that member which you'll now have to modify. However, if you start out with private, switching to protected won't break any other code. But do you ever envision that someone will actually inherit your ListNode class, which is afterall the only reason we'd ever want to make something protected? If the answer is 'no', then private is definitely the way to go. If that assumption is wrong (and I doubt it is), we can easily change to protected later. However...

'protected' is really much closer to 'public' than it is to 'private', as it still allows access to parts of your class. The code that's actually accessing the protected stuff is probably a different audience, but that's the only difference. So you shoud still try provide a well-thought-out interface, just like you do with public stuff, where implementation details are hidden to a practical degree.

In this particular case, you can already access the data through Data() and get/set the next node, so there's really no need for protected at all.

I think ListNode should probably be un-copyable.

Good point. That code was private initially, but I was being hopeful. Is making ListNode uncopyable just a 'correctness' thing like const, or is there some problem that can arise from it being copyable?

public:
ChainedHashTable( int size = 11 );
11?
Is int the best type for the size?
Do you know about explicit constructors?

I still don't understand why I should use something other than an integer, and by explicit constructors you mean things like the copy and assignment constructors? What does that have to do with using a default parameter? 11 was some arbitrary prime number I chose to start the hash table with.

Should key be passed by reference-to-const?

I suppose a ref to const wouldn't hurt, especially if the key is a string. Again, would the key be anything other than a number or string?

Data* LookUp( Key key ) const;
Should key be passed by reference-to-const?
Should the return type be const-qualified?

I took the const after the function declaration to be a promise to the programmer that the the function will not modify the members of the class...So the fact that I'm returning a non-const pointer to the Data, makes it a non-const function?

private:
std::vector *m_pTable;
Perhaps a typo, but do you really want a pointer to a vector, here?

I was unsure what is generally done, so I've been flip-flopping on whether or not to make the vector object reside inside the class, or pointed to by the class. I guess the general case is to just instantiate the object inside the class. Then again, I guess it's a moot point to point to the vector, since the vector is itself already pointing to a dynamic array.

/** Collision resolution func.
* Inserts data into the hashed chain, in order.
* @param index Value of the index where the collision occurred.
*/
void Resolve( int index ) const;
I don't understand this function, or its description. Inserts what data into the hashed chain? Missing a parameter, perhaps?
Should the method be const?

It's supposed to be my collision resolution function. My plan was to have the function take in the index value where the collision occurred, and then insert the new object being allocated into the list existing at that index.

+50 to you sir for dissecting my code and critiquing the design. I understand that I have much to learn about proper and conventional coding, so I'm really glad that you devoted some of your time to help me out.

deus.ex.nova

569

Author

December 09, 2012 04:54 AM

Grrrr. Yet again, I hit "Quote" rather than "Edit". Sorry for the noise.

Which part did you edit? You may have done it while I was writing a response.
*EDIT* Since you helped me out so much, I gave you a +1 on your random post.

the_edd

2,109

December 09, 2012 06:13 AM

Keys are only ever numbers and strings, right?

Those are the most common types, but they're not the only ones. Could be anything, really. You might want to use a 2D coordinate as a key, for example.

Could you clarify what you mean by a "traits mechanism"? I'm unfamiliar with the concept. Perhaps a simple example would clear things up. I do indeed need an overloaded operator==(), only to compare the key held inside the data object though.
[/quote]
Sure. You have a template class like this:

template<typename Key> struct HashTraits; // only declared, no body for the primary template.

It has no definition, but you can specialize it for types that need to work with the hash table:

template<> struct HashTraits<std::string> { static std::size_t Hash(const std::string &s) { /* ... */ } static bool Equal(const std::string &s1, const std::string &s2) { /* ... */ } }; template<> struct HashTraits<int> { static std::size_t Hash(int i) { /* ... */ } static bool Equal(int i1, int i2) { /* ... */ } }; // and so on...
EDIT: changed use of uintptr_t to std::size_t.

Then, if the hash table needs to hash an object it does so by calling "HashTraits<Key>::Hash(key)". A similar thing is done for equality comparisons when searching buckets for matching keys. If there's no HashTraits specialization for the Key type available you get a compiler error, indicating that you need to write one. There are some other options in the design space, here. You might have two separate traits templates, one for the hash and one for equality. Or you might forgo the equality part altogether and just rely on the existence of an operator==.

Now, rather than using traits, you could also simply have the requirement that the user must ensure that there's a Hash() function that takes a Key and returns a hash value, rather like you have the operator< requirement for your SortedList. However, some care must be taken as e.g. Hash(const std::string &) might get called even if the Key type is "const char *", due to C++ overload resolution. This may not be the desired behaviour. Using the traits stuff avoids this problem because there has to be a template specialization for the exact type.

So there's a tradeoff between simplicity and correctness.

Do you want to expose the nodes? Or just the values they contain? By returning Node objects, the user can "accidentally" modify any next/prev pointers.

The only reason I can think of to provide my own node type would be for "intrusive links", but your SortedList doesn't seem to allow that (which is fine). Is the Node template parameter necessary?

I can't seem to decide, because I don't have enough experience to know if I would ever need to expose the node or not. When would exposing the node's data object be an issue in game programming? I definitely don't want the 'next' pointer to be handled by anything but the data structure it is serving (i.e. list).
[/quote]
At times, it can be useful to expose the node type (for example, when you've got an intrusive list, as I mentioned).

If you don't have an explicit use-case in mind for the Node, I would hide it for the time being, if possible. As with the protected/private stuff, it's easier to expose stuff at a later date than it is to hide it, given that people may have started to depend on it.

int Size() const;
Is int the best type to use here?

Should I be using something less expensive than that? Short int? Or something with higher resolution?
[/quote]
No, int should be plenty fast. But why not unsigned int? Or std::size_t?

Since hash tables apparently can get pretty crazy in size, I thought int would an okay data type for size, but please explain why you think otherwise. My brain is a sponge right now.
[/quote]
Indeed, any container can get large, depending on what you put in it So you want to choose a type that can index the largest possible array you can fit in memory. In other words, you really want an integral type that's the same size as a pointer. On pretty much any modern system in the world, std::size_t will fit the bill.

/** Getter function.
* Returns a ptr to the object stored in the node.
*/
T* Data() const { return &m_data; }
IMHO, this should return a reference. In situations where it's possible for there not to be a value, I can understand the choice to return a pointer. But that's not the case here.

Wouldn't I have to feed the function an external variable then? I can't think of any other way to avoid making the function more complicated. Is the extra parameter just a necessary evil?
[/quote]
I Think we're getting our wires crossed. All I'm really suggesting is that the function be changed to "const T &Data() const { return m_data; }" or "T &Data() { return m_data; }". Actually, now that I look at it again, I'm surprised that the original function even compiles (as there should be a 'const' before the 'T *'). Is this Data() function used anywhere, at present?

Is making ListNode uncopyable just a 'correctness' thing like const, or is there some problem that can arise from it being copyable?[/quote]
Yes
What does it mean to copy a node? If the answer is "that's nonsense", then it would be best to statically disallow it. It's better to be notified of attempted nonsense at compile time than at run time.

by explicit constructors you mean things like the copy and assignment constructors?
[/quote]
No, there's a C++ keyword, "explicit".

At present, I can do weird stuff like this:
void foo(const ChainedHashTable<int, int> &tab); // suppose this function exists // ... foo(42); // compiler allows this, as ChainedHashTable's constructor is not explicit!

This is the same mechanism that allows you to pass a "const char *" to a function that takes an std::string. For your constructor, the automatic conversion is probably not wanted!

11 was some arbitrary prime number I chose to start the hash table with.
[/quote]
I see. Presumably the number of buckets is going to change during the life of a hash table, either automatically, or at the user's request(?).

Data* LookUp( Key key ) const;
Should the return type be const-qualified?

I took the const after the function declaration to be a promise to the programmer that the the function will not modify the members of the class...So the fact that I'm returning a non-const pointer to the Data, makes it a non-const function?
[/quote]
Well, let me put is this way. You hand me a hash table, which you say is const. Should I be able to change its content (without resorting to nasty hacks)?

The idiomatic answer for C++ is no, I should not be able to change its content. Sure, the contents of the hash table aren't direct members of the structure, but that sounds like you're just trying to weasel your way out of keeping things const-correct ;)

What often happens is that two overloads are provided for things like this:
const Data* LookUp( Key key ) const; Data* LookUp( Key key );

private:
std::vector *m_pTable;
Perhaps a typo, but do you really want a pointer to a vector, here?

I guess the general case is to just instantiate the object inside the class.
[/quote]
Yeah, using a pointer is just going to require more book-keeping. Note that if you implement a copy constructor and assignment operator for SortedList, you get them for free if the vector is held by value, rather than by pointer.

EDIT: actually, I now realise a custom assignment operator will still be required if you want to provide the strong exception safety guarantee.

void Resolve( int index ) const;

It's supposed to be my collision resolution function. My plan was to have the function take in the index value where the collision occurred, and then insert the new object being allocated into the list existing at that index.
[/quote]
Up until this point you've got something approaching a very general hash table. So it's a bit weird to have something this specific here, IMHO. Is there any way you can move this specific function outside of the hash table?

I understand that I have much to learn about proper and conventional coding,[/quote]
That never ends

so I'm really glad that you devoted some of your time to help me out. [/quote]
No problem, keep plugging away!

http://www.mr-edd.co.uk
http://bitbucket.org/edd

deus.ex.nova

569

Author

December 09, 2012 10:17 PM

It has no definition, but you can specialize it for types that need to work with the hash table:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
template<>
struct HashTraits
{
static std::size_t Hash(const std::string &s) { /* ... */ }
static bool Equal(const std::string &s1, const std::string &s2) { /* ... */ }
};

template<>
struct HashTraits
{
static std::size_t Hash(int i) { /* ... */ }
static bool Equal(int i1, int i2) { /* ... */ }
};

// and so on...
EDIT: changed use of uintptr_t to std::size_t.

Then, if the hash table needs to hash an object it does so by calling "HashTraits::Hash(key)". A similar thing is done for equality comparisons when searching buckets for matching keys. If there's no HashTraits specialization for the Key type available you get a compiler error, indicating that you need to write one. There are some other options in the design space, here. You might have two separate traits templates, one for the hash and one for equality. Or you might forgo the equality part altogether and just rely on the existence of an operator==.

Now, rather than using traits, you could also simply have the requirement that the user must ensure that there's a Hash() function that takes a Key and returns a hash value, rather like you have the operator< requirement for your SortedList. However, some care must be taken as e.g. Hash(const std::string &) might get called even if the Key type is "const char *", due to C++ overload resolution. This may not be the desired behaviour. Using the traits stuff avoids this problem because there has to be a template specialization for the exact type.

So there's a tradeoff between simplicity and correctness.

Thanks for the explanation on traits, it seems like a very useful strategy. Also I'm assuming you're saying that the more correct way is to require that there is an existing hash function for their key type? Would using a function pointer parameter with the hash function help solve issues with custom types, or is using traits the more common approach?

Quote
e?dd, on 08 December 2012 - 07:18 PM, said:
/** Getter function.
* Returns a ptr to the object stored in the node.
*/
T* Data() const { return &m_data; }
IMHO, this should return a reference. In situations where it's possible for there not to be a value, I can understand the choice to return a pointer. But that's not the case here.
Wouldn't I have to feed the function an external variable then? I can't think of any other way to avoid making the function more complicated. Is the extra parameter just a necessary evil?
I Think we're getting our wires crossed. All I'm really suggesting is that the function be changed to "const T &Data() const { return m_data; }" or "T &Data() { return m_data; }". Actually, now that I look at it again, I'm surprised that the original function even compiles (as there should be a 'const' before the 'T *'). Is this Data() function used anywhere, at present?

I see. I didn't realize that doing something like T& Data() { return m_data; } was legal. From what I recalled about returning by reference, I thought that you had to do something like:
T& Data( T &data )
{
data = m_data;
return data;
}
I must have missed something...
Also, I haven't actually started compiling yet (bad idea?). I've just been designing the data structures I need for my assignment, so no, the Dat() function is not being used anywhere yet.

Quote
by explicit constructors you mean things like the copy and assignment constructors?
No, there's a C++ keyword, "explicit".

At present, I can do weird stuff like this:
?
1
2
3
4
5
void foo(const ChainedHashTable &tab); // suppose this function exists

// ...

foo(42); // compiler allows this, as ChainedHashTable's constructor is not explicit!

This is the same mechanism that allows you to pass a "const char *" to a function that takes an std::string. For your constructor, the automatic conversion is probably not wanted!

The compiler allows that? So this is basically a concern for any custom type, huh? Does the explicit keyword only apply to the constructor, including the copy constructor and assignment operator?

Presumably the number of buckets is going to change during the life of a hash table, either automatically, or at the user's request(?).

Honestly now that I think about it, resizing the hash table after data has been added is probably a bad idea. It screws up the distribution of the data that the hashing algorithm is trying to avoid. So maybe I should just not include that option.

Well, let me put is this way. You hand me a hash table, which you say is const. Should I be able to change its content (without resorting to nasty hacks)?

The idiomatic answer for C++ is no, I should not be able to change its content. Sure, the contents of the hash table aren't direct members of the structure, but that sounds like you're just trying to weasel your way out of keeping things const-correct ;)

What often happens is that two overloads are provided for things like this:
?
1
2
const Data* LookUp( Key key ) const;
Data* LookUp( Key key );

Okay, so why would anyone include an overloaded function that grants access to the data if there is a similar function that is trying to protect the data? It seems counter-intuitive to me.

Up until this point you've got something approaching a very general hash table. So it's a bit weird to have something this specific here, IMHO. Is there any way you can move this specific function outside of the hash table?

Like I mentioned before in this post, my first inclination is to use function pointers, but your traits concept seems like a good solution as too.

Thanks again for replying to my questions. These changes will be very helpful. Using std::size_t instead of unsigned int or regular int for some reason was mind-blowing.

The same goes for traits. +50 more to you!

*EDIT* What's with the quotes in my post? Unlike yours, the code boxes never copy over right. I've been using the "selective quote" feature.

rip-off

11,000

December 09, 2012 11:07 PM

Thanks for the explanation on traits, it seems like a very useful strategy. Also I'm assuming you're saying that the more correct way is to require that there is an existing hash function for their key type? Would using a function pointer parameter with the hash function help solve issues with custom types, or is using traits the more common approach?

Traits would be the idiomatic C++ solution to this. The reason is that function pointers incur additional runtime cost.

I see. I didn't realize that doing something like T& Data() { return m_data; } was legal. From what I recalled about returning by reference, I
thought that you had to do something like:
T& Data( T &data ) { data = m_data; return data; }
I must have missed something...
[/quote]
You can return a reference, provided the object being referenced outlives the function. For automatically allocated data (local variables), this is not the case. You can however, return references to global data, heap data or references that you have been passed. Member data also falls into this category, it becomes the caller's problem to ensure that the reference remains valid:
struct Example { Example(const std::string name) : m_name(name) { } const std::string &name() const { return m_name; } private: std::string m_name; }; void bad() { // Uh no, we're taking a reference to the data member.. const std::string &illegal = Example("Jim").name(); // ... but the temporary "Example" instance has just expired! // And we're dead... std::cout << illegal << std::endl; }

Also, I haven't actually started compiling yet (bad idea?). I've just been designing the data structures I need for my assignment, so no, the Dat() function is not being used anywhere yet.
[/quote]
I'd recommend compiling more often. The linked list code could be designed and tested in isolation, before trying to incorporate it into the more complex hash table. This way, you can be reasonably confident that any bugs in your hash table are in that code, because you've hopefully eliminated the errors in the linked list.

The compiler allows that? So this is basically a concern for any custom type, huh?
[/quote]
Yes. As edd indicated, this is what allows you to write:
void foo(const std::string &s) { /* ... */ } // ... foo("Hello, world");
Arguably, we should have an "implicit" keyword instead, with the potentially surprising behaviour not being the default. That said, I've never personally bothered with explicit functions myself, but I've not used C++ as a primary language for some time.

Does the explicit keyword only apply to the constructor, including the copy constructor and assignment operator?
[/quote]
You only need to think about it for single argument constructors. And as std::string shows, it can be quite convenient sometimes. Copy constructors are fine - explicit copy constructors don't sound like a good idea to me, I have not come across them before. You cannot have an explicit assignment operator.

Honestly now that I think about it, resizing the hash table after data has been added is probably a bad idea. It screws up the distribution of the data that the hashing algorithm is trying to avoid. So maybe I should just not include that option.
[/quote]
If you do not resize, further insertions are going to cause more and more collisions, which could cause the hash table to degenerate towards it's worst case (a linked list). This is precisely the opposite behaviour of what client code is written to expect - it will be written to expect almost constant time access.

Okay, so why would anyone include an overloaded function that grants access to the data if there is a similar function that is trying to protect the data? It seems counter-intuitive to me.
[/quote]
Is this counter intutive:
void inspect(const std::vector<int> &v) { if(!v.empty()) { std::cout << "The first element is: " << v[0] << std::endl; } } void modify(std::vector<int> &v) { if(!v.empty()) { v[0] = 42; } }
std::vector::operator[] is overloaded for const and non const access, allowing both client functions to use it naturally.

blog | twitter

the_edd

2,109

December 09, 2012 11:42 PM

Thanks for the explanation on traits, it seems like a very useful strategy. Also I'm assuming you're saying that the more correct way is to require that there is an existing hash function for their key type?

Of the stuff I mentioned, I prefer traits to having the user ensure an overloaded function is in scope, as it avoids accidental use of overloaded functions and eliminates the possibility of ODR violations (a somewhat technical subject to which I won't subject you unless provoked

).

Would using a function pointer parameter with the hash function help solve issues with custom types, or is using traits the more common approach?
[/quote]
As ripoff said, function pointers tend to be slower as in general they aren't very transparent to a compiler.

However, there's another design option, which is to use function objects and add a third template parameter to your hash table. We first provide define a Hash<T> template class with some useful specializations:

template<typename T> struct Hash; template<> struct Hash<std::string> { std::size_t operator() (const std::string &key) const { /* ... */ } }; template<> struct Hash<int> { std::size_t operator() (int key) const { /* ... */ } }; // and so on.

This is rather similar to the traits stuff so far, except that we need to create an instance of Hash<Key> and 'call' it to get a hash value, rather than calling a static function. Such an instance will be kept inside the hash table:

template<typename Key, typename Value, typename HashFunc = Hash<Key> > class HashTable { public: HashTable(const HashFunc &hash = HashFunc()) : buckets(/* ... */), hash(hash) { } //... private: std::vector< ... > buckets; HashFunc hash; };

Internally, whenever the HashTable needs to get the hash of a particular key, it 'calls' its hash object. This is preferable to the use of function pointers, as a compiler can better optimize calls to Hash<>::operator()s via inlining etc.

Note that we've given the HashFunc template parameter a 'sensible default'. This means the HashTable should work out the box for common key types. However, the user could provide their own type/object for hashing.

This is the kind of approach that's used in the standard library for containers. For example, std::map<K,V> uses std::less<K> to perform comparisons on keys by default, but you can plug in your own comparison function object if required. It's quite complicated (which is why I didn't mention it until now), but it does offer both flexibility and performance.

http://www.mr-edd.co.uk
http://bitbucket.org/edd

deus.ex.nova

569

Author

December 10, 2012 03:25 AM

You can return a reference, provided the object being referenced outlives the function. For automatically allocated data (local variables), this is not the case. You can however, return references to global data, heap data or references that you have been passed. Member data also falls into this category, it becomes the caller's problem to ensure that the reference remains valid:

Just to make sure I understand what you're saying. Could I turn your bad() function to a good() one by doing this:



void good()

{

	 // Creating an Example instance so that the reference remains valid in this scope.

	 Example jim = Example("Jim");

	 const std::string &legal = jim.name();

	 // Works because the "Example" instance still exists here!



	// And we're ALIVE...hooray!

	std::cout << legal << std::endl;

}

If you do not resize, further insertions are going to cause more and more collisions, which could cause the hash table to degenerate towards it's worst case (a linked list). This is precisely the opposite behaviour of what client code is written to expect - it will be written to expect almost constant time access.

What then would be the best way to deal with a lopsided distribution of data in the hash table, due to the resize? Rehashing the data while it was being copied over to a larger table COULD be a solution, but sounds like a painfully expensive operation. If there is a decent solution, is it best to leave the user/programmer using the class to manually resize the table, and/or implement an auto-resize mechanism like in std::vector? Lastly, game engines don't tend to use chaining do they, since it doesn't seem cache-friendly? 1-2-3-GO!

Is this counter intutive:
void inspect(const std::vector &v) { if(!v.empty()) { std::cout << "The first element is: " << v[0] << std::endl; } } void modify(std::vector &v) { if(!v.empty()) { v[0] = 42; } }
std::vector::operator[] is overloaded for const and non const access, allowing both client functions to use it naturally.

I think I see what you're saying. At first it still seems like one non-const solution would work for both situations, but I'm assuming this is again a "correctness"/proper style thing. So the compiler will always for for the const function, unless the user wishes to change a value.

However, there's another design option, which is to use function objects and add a third template parameter to your hash table. We first provide define a Hash template class with some useful specializations:

template struct Hash; template<> struct Hash { std::size_t operator() (const std::string &key) const { /* ... */ } }; template<> struct Hash { std::size_t operator() (int key) const { /* ... */ } }; // and so on.

This is rather similar to the traits stuff so far, except that we need to create an instance of Hash and 'call' it to get a hash value, rather than calling a static function. Such an instance will be kept inside the hash table:

template > class HashTable { public: HashTable(const HashFunc &hash = HashFunc()) : buckets(/* ... */), hash(hash) { } //... private: std::vector< ... > buckets; HashFunc hash; };

Internally, whenever the HashTable needs to get the hash of a particular key, it 'calls' its hash object. This is preferable to the use of function pointers, as a compiler can better optimize calls to Hash<>::operator()s via inlining etc.

Note that we've given the HashFunc template parameter a 'sensible default'. This means the HashTable should work out the box for common key types. However, the user could provide their own type/object for hashing.

This is the kind of approach that's used in the standard library for containers. For example, std::map uses std::less to perform comparisons on keys by default, but you can plug in your own comparison function object if required. It's quite complicated (which is why I didn't mention it until now), but it does offer both flexibility and performance.

Very interesting. I might just have to implement this for my hash table then. Thanks! Geez...where can I learn more of this stuff!?

Thank you both for extensively helping me with this, and allowing me to pick at your brains. It's been a great learning experience. I will go ahead and add traits for my hash table, and the additional template parameter that edd mentioned.

*EDIT* Sorry, I referred to edd's last suggestion as traits, but with a second look I'm no longer sure. Is there a particular name for this kind of design?

[C++] Problems designing a proper hash table w/ chaining

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

[C++] Problems designing a proper hash table w/ chaining

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines