[C++] Unrolled linked list help

1,842

Author

December 06, 2008 12:52 PM

I've been making an unrolled linked-list class not unlike the one found over at http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx

#include <cstdlib>

#define _USE_ASMLIB_MEMCPY_FUNCTIONS
//#define _NO_INTRINSIC_MEMCPY	// this really should only be disabled for support reasons

#if defined( _USE_ASMLIB_MEMCPY_FUNCTIONS )	// use Agner Fog's optimized memcpy() function for a small speed improvement
											// (The arrays can be made 16-byte aligned if absolutely necessary)
	#include "../MiscLibs/AsmLib/asmlib.h"
	#define MEMCPY A_memcpy
#else if !defined( _NO_INTRINSIC_MEMCPY )
	#pragma intrinsic(memcpy)
	#define MEMCPY memcpy
#endif

#if !defined( GMForceInline )
	#if defined ( _MSC_VER )
		#define GMForceInline __forceinline
	#endif
#endif

#if !defined XORSwap
#define XORSwap( a, b ) (((a) == (b)) || (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b))))
#endif

template<class T>
class BlockArray
{
public:
	explicit BlockArray(const size_t &_blockWidth, const size_t _numBlocks) : blockWidth( _blockWidth ), numElements( _blockWidth * _numBlocks ), numBlocks( _numBlocks )
	{
		realArray = new T*[_numBlocks];
		for(size_t iter=0; iter<numBlocks; iter++)
		{
			realArray[iter] = new T[blockWidth];
		}
	}

	~BlockArray()
	{
		delete[] realArray;
		realArray = NULL;
		numFreeElements = numElements = 0;
	}

	GMForceInline size_t GetCurrentArrayWidth() const	// returns the current number of allocated elements in the array
	{
		return numElements;
	}

	GMForceInline size_t GetNumAllocatedBlocks() const	// returns the number of 'blocks' or 'pages' in the way
	{
		return numBlocks;
	}

	GMForceInline size_t GetBlockWidth() const	// returns the 'block width' or 'page size' of the array
	{
		return blockWidth;
	}

	GMForceInline T& operator[](const size_t &index)
	{
		//if(numElements < index) // basic bounds-checking
		//	return NULL;
		locationAssist = div((long)index, (long)blockWidth);
		return realArray[locationAssist.quot][locationAssist.rem];

	}

	GMForceInline bool AppendNewBlock()
	{
		// this is what I'm working on
	}
protected:

private:
	BlockArray(BlockArray &other);
	T** realArray;
	const size_t blockWidth;
	size_t numElements;
	size_t numBlocks;
	size_t numFreeElements;
	mutable ldiv_t locationAssist;	// this needs to be mutable, as it's how we locate stuff, regardles of const-ness
	// TODO: Maybe make some kind of hasher or something for a fast compare
};
#undef MEMCPY

As you may have guessed, I'd like to find an efficient way of doing the appending. As of right now I came up with the idea to allocate a new pointer array of the new, expanded size and do a memcpy() or memmove() on the pointers to the chunks into this newly created one. The problem is that I can't think of a reasonable syntax for doing so without a for loop headache. It really seems like there's something I'm missing here, so I ask the C++ gurus for help solving my problem.

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

iMalc

2,466

December 06, 2008 02:18 PM

You're not implementing the same thing as on the page you linked to. That site describes a linked-list of arrays, but you're implementing an array of pointers to arrays. Essentially a std::deque, iirc.

You also leak memory. Compare how many times you call new[] in the constructor versus how many times you call delete[] in the destructor.
There's no point setting values to zero in the destructor. The object is gone after the destructor has run. From the looks of your incorporating of specialised memcpy functions for speed, I'd think you'd be interested in removing unnecessary operations.

How about making blockWidth a template parameter instead of a constructor argument? Note that there isn't really any usefulness in marking the constructor explicit when it takes mutliple arguments.
Also, if you're going to stick with an array of pointers to arrays, then you may as well make that pointer array dynamically resizeable, rather than having to specify numBlocks beforehand. When that array size needs to grow you grow by doubling the size, or better yet, use a vector of pointers to arrays, so it does that for you. Then you could have no arguments in your constructor!

I'm cringing at the sight of an XORSwap btw. Much better to use std::swap.

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

InvalidPointer

1,842

Author

December 06, 2008 10:00 PM

Quote:Original post by iMalc
You're not implementing the same thing as on the page you linked to. That site describes a linked-list of arrays, but you're implementing an array of pointers to arrays. Essentially a std::deque, iirc.

You also leak memory. Compare how many times you call new[] in the constructor versus how many times you call delete[] in the destructor.
There's no point setting values to zero in the destructor. The object is gone after the destructor has run. From the looks of your incorporating of specialised memcpy functions for speed, I'd think you'd be interested in removing unnecessary operations.

How about making blockWidth a template parameter instead of a constructor argument? Note that there isn't really any usefulness in marking the constructor explicit when it takes mutliple arguments.
Also, if you're going to stick with an array of pointers to arrays, then you may as well make that pointer array dynamically resizeable, rather than having to specify numBlocks beforehand. When that array size needs to grow you grow by doubling the size, or better yet, use a vector of pointers to arrays, so it does that for you. Then you could have no arguments in your constructor!

I'm cringing at the sight of an XORSwap btw. Much better to use std::swap.

Not quite, unless I *really* screwed up in my implementation :S I actually use another array to keep track of all the 'nodes.' I want to say it's a bit like a handle manager or 'lookup table' for the different blocks, but I fear it doesn't really do it justice. Hopefully that might be bit clearer w/ some context.

Regarding the zeroing destructor: Wow. How did I miss that? >_< My only guess is that I was tired or something, as I really have no idea what the logic would have been behind that. Maybe like 'we need to reset the array and make everything default' or something like that.

Regarding the delete[]: Hm. I was under the opinion that it would get handled correctly. Did you mean call delete[][] or something? The accessor that utilizes a similar syntax works without error AFAIK, so that seems logical.

Templates: The idea had been running through my head beforehand, and it does make sense. I'll probably go ahead and do it. Although I guess you might want to see previous regarding that 'vector of pointers' thing. It's basically what I have going right now, except a little more lightweight.

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

Splinter of Chaos

239

December 06, 2008 10:57 PM

Quote:Original post by iMalc
I'm cringing at the sight of an XORSwap btw. Much better to use std::swap.

Me too. And I just looked at the assembly produced by std::swap and the xor method.

The xor method had to move (copy) x and y from memory: 3 moves in all. Three xors, two of which were with one of the operands in memory, thus slower.

The std::swap method. Load x and y from memory, put y and x in memory. 4 move instructions. std::swap turned out more efficient because x and y were not already loaded from memory. And, I bet the compiler would be smart enough to use the xor method if they were.

The whole point of the standard library is to do things like this better than you would because you don't know any better. (Well, and also to standardize things that are difficult to make portable.) The fact is, your compiler produces the fastest possible code when you let it, so you should.

I tested this with GCC, but I'd doubt MSVC would do any worse.

Memcpy, I showed in another thread that the standard is best. (Although, if there's a reason you're not using it, I would like to know.)

Also: So, is this the container class of your implementation of what you saw or something? I didn't get that, but neat.

And you wouldn't need locationAssist to be mutable if you defined it in operator[] as static. That way, your code couldn't be safely called from two threads (until thread_local storage becomes standard), but unless you're doing multi-threading, you needn't mind.

You define blockWidth as const...I'd figure that'd give a compiling error, but if it doesn't, fine.

Going back to what you ACTUALLY came here for,
You want an efficient way to append? Well...I guess I don't understand the problem. (*this)[++numBlocks] = new T[blockWidth]; wouldn't work? Or are you looking for a dynamic number of dynamically allocated blocks? If you are, I would suggest having an std::vector<T[blockWidth]> or something.

[Edited by - Splinter of Chaos on December 6, 2008 11:57:44 PM]

My blog.
My current project, Pure.

iMalc

2,466

December 07, 2008 12:12 AM

Quote:Original post by InvalidPointer
Regarding the delete[]: Hm. I was under the opinion that it would get handled correctly. Did you mean call delete[][] or something? The accessor that utilizes a similar syntax works without error AFAIK, so that seems logical.

The destructor would need to look like this:

	~BlockArray()	{		for(size_t iter=0; iter<numBlocks; iter++)		{			delete[] realArray[iter];		}		delete[] realArray;	}

By changing this into a vector (length n/k) of circular buffers (length k), you can create a container with O(1) random access, iteration, push/pop from either end, plus amortised O(n/k+k) insertion & removal from the middle!
To insert an item in the middle somewhere, you move your circular buffer pointers on some of the internal arrays, and only copy 1 item across to each circular buffer from the circular buffer next to it, to move items over. You'd want to keep the size of the circular buffers at approximately sqrt(n), but also only ever change size by a factor of two, and only when n/k or k gets far enough away from sqrt(n).
It's probably about as close as you can come to the holy grail of O(1) time for all the above operations. The iterator invalidation would be funky though.

Come to think about it, perhaps the std::deque does exactly this already...

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

InvalidPointer

1,842

Author

December 07, 2008 12:31 AM

Quote:Original post by Splinter of Chaos
Quote:Original post by iMalc
I'm cringing at the sight of an XORSwap btw. Much better to use std::swap.

Me too. And I just looked at the assembly produced by std::swap and the xor method.

The xor method had to move (copy) x and y from memory: 3 moves in all. Three xors, two of which were with one of the operands in memory, thus slower.

The std::swap method. Load x and y from memory, put y and x in memory. 4 move instructions. std::swap turned out more efficient because x and y were not already loaded from memory. And, I bet the compiler would be smart enough to use the xor method if they were.

The whole point of the standard library is to do things like this better than you would because you don't know any better. (Well, and also to standardize things that are difficult to make portable.) The fact is, your compiler produces the fastest possible code when you let it, so you should.

I tested this with GCC, but I'd doubt MSVC would do any worse.

Memcpy, I showed in another thread that the standard is best. (Although, if there's a reason you're not using it, I would like to know.)

Also: So, is this the container class of your implementation of what you saw or something? I didn't get that, but neat.

And you wouldn't need locationAssist to be mutable if you defined it in operator[] as static. That way, your code couldn't be safely called from two threads (until thread_local storage becomes standard), but unless you're doing multi-threading, you needn't mind.

You define blockWidth as const...I'd figure that'd give a compiling error, but if it doesn't, fine.

Going back to what you ACTUALLY came here for,
You want an efficient way to append? Well...I guess I don't understand the problem. (*this)[++numBlocks] = new T[blockWidth]; wouldn't work? Or are you looking for a dynamic number of dynamically allocated blocks? If you are, I would suggest having an std::vector<T[blockWidth]> or something.

I think the XOR swap is another one of those times where I'm not exactly sure of my logic in including them. It's never actually used in code, but again thanks for the usage tip.

Memcpy: Agner made a fairly small utility library that includes some faster implementations of standard C library functions, available at his site. I have it set so that the class can be compiled to use them, as per the set of #defines. I'm not sure if that's what you're asking, though. The speedup is actually fairly decent over MSVC for unaligned copies, taking about .51 cycles/byte less (0.63 vs. 0.12) than Microsoft's version. YMMV depending on CPU architecture and compiler, although it's pretty much guaranteed to be one of the fastest options for any given set of the above. I think Intel's the only group that managed to beat the unaligned clocks, and even then by only .01 cycles/byte.

Vector stuff: It's dynamic. I'd like to tack on a new set of items onto the end of the list by modifying the 'main' table. At the same time, I'm rolling my own as I really have neither need nor want for the iterator functionalities and stuff like that. I'd prefer to just skip the overhead outright, if possible. Hence I'm making my own, trimmed-down version.

Implementation: It isn't a perfect 'here's-the-standard-here's-the-implementation" style, but I tried to expand upon the concepts that were mentioned and take them a step further. Here the CPU can actually try and snap up all the pointers in one go/line and theoretically avoid further cache fun. It may work, it may not, but you learn something in the process. Cache-oblivious structures and algorithms also strike me as intriguing and I'll probably also add a library for this type of thing into my engine once this gets done.

BTW, multithreading *is* a must for me. Although TLS is non-standard I have a set of macros for doing exactly that sort of thing using a couple tricks I learned from looking at what Agner did for data alignment. Simple, yet quite clever.

EDIT:

Quote:Original post by iMalc
Quote:Original post by InvalidPointer
Regarding the delete[]: Hm. I was under the opinion that it would get handled correctly. Did you mean call delete[][] or something? The accessor that utilizes a similar syntax works without error AFAIK, so that seems logical.

The destructor would need to look like this:*** Source Snippet Removed ***

By changing this into a vector (length n/k) of circular buffers (length k), you can create a container with O(1) random access, iteration, push/pop from either end, plus amortised O(n/k+k) insertion & removal from the middle!
To insert an item in the middle somewhere, you move your circular buffer pointers on some of the internal arrays, and only copy 1 item across to each circular buffer from the circular buffer next to it, to move items over. You'd want to keep the size of the circular buffers at approximately sqrt(n), but also only ever change size by a factor of two, and only when n/k or k gets far enough away from sqrt(n).
It's probably about as close as you can come to the holy grail of O(1) time for all the above operations. The iterator invalidation would be funky though.

Come to think about it, perhaps the std::deque does exactly this already...

Ring buffers are always cool! :) As mentioned, iterators are a little too high-level for this particular thing, as all I really want is storage, but it's certainly worth looking into, especially if we have a holy-grail contender on our hands. Also: Why bother with std::vector or any of that if deque is so damn superior?

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

Splinter of Chaos

239

December 07, 2008 05:17 PM

Quote:Original post by InvalidPointer
I think the XOR swap is another one of those times where I'm not exactly sure of my logic in including them. It's never actually used in code, but again thanks for the usage tip.

Yeah, I think the xor trick is from a time when std::swap wasn't around or compilers didn't optimize it. It seems, now-a-day, that calling a simple function like swap or memcpy hints to the compiler what to do, more than calls it.

Quote:Memcpy: Agner made a fairly small utility library that includes some faster implementations of standard C library functions, available at his site. I have it set so that the class can be compiled to use them, as per the set of #defines. I'm not sure if that's what you're asking, though. The speedup is actually fairly decent over MSVC for unaligned copies, taking about .51 cycles/byte less (0.63 vs. 0.12) than Microsoft's version. YMMV depending on CPU architecture and compiler, although it's pretty much guaranteed to be one of the fastest options for any given set of the above.

Hmm, this does intrigue me. I think I'll check the library out, so thanks.

Anyway, thanks, I'll check it out!

Quote:Vector stuff: It's dynamic. I'd like to tack on a new set of items onto the end of the list by modifying the 'main' table. At the same time, I'm rolling my own as I really have neither need nor want for the iterator functionalities and stuff like that. I'd prefer to just skip the overhead outright, if possible. Hence I'm making my own, trimmed-down version.

You don't get the iterators if you don't use them. Compilers only have to compile a templated class if you explicitly declare it (vector<t>::iterator;) or create an instance, and the member functions are also only created if you call them. You don't have any iterator support unless you ask for it. Besides, iterators are really light weight. The code might be complicated (look up std::advance), but the code produced isn't (x += 5).

But, you don't have to use the STL if you don't want. You could duplicate the functionality by duplicating the code. What they do is they initialize a new array of the same size + how much space they want, which is usually more than they need, which means they have to do this less often. They copy all the bits from one to the other. They update the pointer to the array. Then, they delete[] the original. It's really not a bad way to do it. Best if you can already have enough space and all your pointers pointing to an element in there will be invalid, but when you want it dynamic, that's how it goes.

Quote:Ring buffers are always cool! :) As mentioned, iterators are a little too high-level for this particular thing, as all I really want is storage, but it's certainly worth looking into, especially if we have a holy-grail contender on our hands.

Just to reiterate (pun...I'll go with intended), you won't have to deal with them if you don't want, and they aren't as hassle some as you let on.

[Edited by - Splinter of Chaos on December 7, 2008 7:17:52 PM]

My blog.
My current project, Pure.

InvalidPointer

1,842

Author

December 08, 2008 06:43 PM

Quote:Original post by Splinter of Chaos
Just to reiterate (pun...I'll go with intended), you won't have to deal with them if you don't want, and they aren't as hassle some as you let on.

Zing! I LOL'd.

Anyway, I was wondering *how* to go about doing that copy. The actual process was exactly what I had in mind, (see OP) but I was wondering if there was a better way to do do the copy over some for loopage. The copy op itself is fairly straightforward, but can I just do MEMCPY(newArray, realArray, (sizeof(T*) * ++numElements)?

That also strikes me as being thread-unsafe, but that's a whole different, erm, topic.

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

Splinter of Chaos

239

December 10, 2008 12:48 AM

I don't see anything wrong with that, but you'll reduce the number of copies required, as the makers of the stl discovered, if you allocate more space than you need. That is, need presently.

But, you'll need to do a little more thinkin' since you're allocating per block, not per element. Yeah, even the code I showed that I should have thought about was wrong...

BTW: I think the term block is being misused, but I'm not sure on that. Wouldn't a block be the 2D array, not the row?

And one thing I didn't think of until now: why not store ALL the data in a one dimensional array that acts like a two dimensional array? Instead of making a "realArray" of T*s, then initializing those with their own blocks, make ONE block of size number of rows * number of columns? Since you're going for speed, that would create one less level of indirection.

My blog.
My current project, Pure.

InvalidPointer

1,842

Author

December 13, 2008 09:14 AM

Quote:Original post by Splinter of Chaos
I don't see anything wrong with that, but you'll reduce the number of copies required, as the makers of the stl discovered, if you allocate more space than you need. That is, need presently.

But, you'll need to do a little more thinkin' since you're allocating per block, not per element. Yeah, even the code I showed that I should have thought about was wrong...

BTW: I think the term block is being misused, but I'm not sure on that. Wouldn't a block be the 2D array, not the row?

And one thing I didn't think of until now: why not store ALL the data in a one dimensional array that acts like a two dimensional array? Instead of making a "realArray" of T*s, then initializing those with their own blocks, make ONE block of size number of rows * number of columns? Since you're going for speed, that would create one less level of indirection.

Because then I lose a lot of the niceties of being able to allocate 'more space than I need' in a very elegant way. The 'blocks' correspond to the number of elements you actually allocate per allocation operation. In addition, these are also guaranteed to be physically next to each other in memory, which is something of a tradeoff-- allocation is a bit faster, but you burn a few extra computation cycles in order to get full random access to elements in the array. I also came up with the idea to "pull out" entire rows of data so you can do lots of operations on a small data subset while only incurring the overhead associated with indexing into a regular array.

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

[C++] Unrolled linked list help

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

[C++] Unrolled linked list help

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines