A quick way to determine if 8 small integers are all equal to each other

Started by
15 comments, last by Wyrframe 7 years, 6 months ago

Given eight integers, what is the fastest way to check if they are all the same or not?

Maybe, some clever bit manipulation?

(This is used in a voxel engine - integers are material IDs: if a cell has different material IDs at corners, then it must intersect a surface.)

I've implemented this simple solution, but it only works with 8-bit values, and 255 materials (0 == empty air) won't be enough:


/// returns true if all eight bytes are equal to each other
template< typename T >
bool AreAllTheSame8( const T (&_array)[8] )
{
    const T v0 = _array[ 0 ];
    const T v1 = _array[ 1 ];
    const T v2 = _array[ 2 ];
    const T v3 = _array[ 3 ];
    const T v4 = _array[ 4 ];
    const T v5 = _array[ 5 ];
    const T v6 = _array[ 6 ];
    const T v7 = _array[ 7 ];

    static_assert(sizeof(T) == sizeof(uint8_t));
    uint8_t    counts[256];
    mxZERO_OUT(counts);

    counts[ v0 ]++;
    counts[ v1 ]++;
    counts[ v2 ]++;
    counts[ v3 ]++;
    counts[ v4 ]++;
    counts[ v5 ]++;
    counts[ v6 ]++;
    counts[ v7 ]++;
    return counts[ v0 ] == 8;
}

In reality, the zero materials are not 'counted' (i can explain why if you're curious), and the above solution results in this ugly and inefficient code:


/// returns true if all eight materials are equal to each other
bool IsCellHomogeneous( const MaterialID (&_voxels)[8] )
{
    const MaterialID v0 = _voxels[ 0 ];
    const MaterialID v1 = _voxels[ 1 ];
    const MaterialID v2 = _voxels[ 2 ];
    const MaterialID v3 = _voxels[ 3 ];
    const MaterialID v4 = _voxels[ 4 ];
    const MaterialID v5 = _voxels[ 5 ];
    const MaterialID v6 = _voxels[ 6 ];
    const MaterialID v7 = _voxels[ 7 ];

    U8    counts[256];
    mxZERO_OUT(counts);

    UINT numSolidMaterials = 0;
    UINT lastSolidMaterial = 0;

    // zero material IDs correspond to empty space/air
    if( v0 ) {
        counts[ v0 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v0;
    }
    if( v1 ) {
        counts[ v1 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v1;
    }
    if( v2 ) {
        counts[ v2 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v2;
    }
    if( v3 ) {
        counts[ v3 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v3;
    }
    if( v4 ) {
        counts[ v4 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v4;
    }
    if( v5 ) {
        counts[ v5 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v5;
    }
    if( v6 ) {
        counts[ v6 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v6;
    }
    if( v7 ) {
        counts[ v7 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v7;
    }
    return counts[ lastSolidMaterial ] == numSolidMaterials;
}
Advertisement

Wrote in N++ without testing so I'm not sure if it works correctly.


bool IsCellHomogeneous( const MaterialID (&_voxels)[8] ) {
	size_t idx = 0;
	
	// skip all 0
	for(; idx < 8 && _voxels[idx] == 0; ++idx); // intentionally empty
	
	if(idx == 8)
		return true; // everything is air
	
	const MaterialID expectedVal = _voxels[idx];
	
	for(; idx < 8 && (_voxels[idx] == 0 || _voxels[idx] == expectedVal); ++idx); // intentionally empty
	
	if(idx == 8)
		return true; // everything is same or 0
	
	return false;
}
Got SSE2 (or better SSE3) support?

If all values are the same, then each must be the same as any one that you pick at random, too. Thus, comparing a SSE register against another which has all fields initilaized to e.g. the first one must result in an all-ones mask (which you can then extract).

Now, the _mm_set1 family of intrinsics is usually quite poor, so that would anihilate all benefits, but luckily you don't need it.

You must load the SSE register with the to-be-compared values anyway, and then you can just do a shuffle (SSE2 will need the high/low versions, SSE3 can do directly) to scatter the value over another register. That should be pretty fast. Then compare, and see if movemask gives an all-ones pattern.

This is good for 8 values 16 bits in size each, for example.

Create a table with 256 entries, use 32-bit values in x86, 64-bit in x64; 128-bit with SSE.
The entries of the table would be:


#define MAKE_MASK( x ) ((x) << 24u) | ((x) << 16u) | ((x) << 8u) | (x)

const uint32_t table[256] =
{
    0,
    MAKE_MASK( 1 ),
    MAKE_MASK( 2 ),
//...
    MAKE_MASK( 255 ),
}; 

Then just do a 32/64/128-bit compare:


uint8_t idx = (uint8_t*)_array[0];
if( table[idx] == ((uint32_t*)_array)[0] &&  table[idx] == ((uint32_t*)_array)[1] )
{
   //All 8 values are equal
}

Given eight integers, what is the fastest way to check if they are all the same or not?

Finding "the fastest" is nearly always a waste of time. The out of order CPU makes most timings meaningless and cache efficiency is more important than almost any other factor.

This operation is not going to be even a blip on the profiler. It will be the accidentally triple-deep iteration through your entire object heirarchy that causes concerns, not comparison of eight items.

Code is untested because it is late at night on a weekend, but should be clear enough. Fastest for the programmer is going to be either this:


template< typename T >
bool AreAllTheSame8( const T (&_array)[8] ) {
 for(int i=1; i<8; i++) {
   if(_array[0]!= _array[i]) return false;
 }
 return true;
}

Or this:


template< typename T >
bool AreAllTheSame8( const T (&_array)[8] ) {
return _array[0] == _array[1]
&& array[0] == array[2]
&& array[0] == array[3]
&& array[0] == array[4]
&& array[0] == array[5]
&& array[0] == array[6]
&& array[0] == array[7];
}

It is possible that there is some minor rearranging that will yield a slight improvement. Pairwise (01,23,45,67,02,46,04) or similar. But really this type of traveling along an array is the absolute best case for the cache, and the optimizer knows how to handle it.

No trickery is necessary in 2016. Anything the optimizer can do it will find on its own, and the CPU will be able to zip through that in less than a nanosecond unless there happens to be a cache miss at the beginning.

For some "small integers" (like unsigned 8-bit integers), you can read them all in as a single large integer and then do

  return my_i64 == (((my_i64) & 0xffull) * 0x0101010101010101ull);
That's a special case if you ignore the templated type of the function. For 8-bit values and even 16-bit values there are some bitwise hacks, but they don't really buy anything on a modern processor, and can even cause pipeline stalls if you aren't careful. Introducing data dependencies slows things down a tiny bit.

Again, this is something so small it isn't a performance concern. Given the way current processors work it is quite literally two cycles through the decoder -- each CPU can decode up to 4 instructions per cycle -- and whatever cost to get the data from cache.

So, I take it you want to avoid branch mispredictions messing up things. As long as you only do an assignment in the condition block, the compiler (assuming a release build) will be able to do it without branches.

I just tried the following with GCC (you would remove the noinline in production code obviously)


template< typename T >
bool __attribute__ ((noinline)) AreAllTheSame8( const T (&array)[8] ) {
    bool same = true;
    if (array[0]!=array[1]) same = false;
    if (array[0]!=array[2]) same = false;
    if (array[0]!=array[3]) same = false;
    if (array[0]!=array[4]) same = false;
    if (array[0]!=array[5]) same = false;
    if (array[0]!=array[6]) same = false;
    if (array[0]!=array[7]) same = false;
    return same;
}

And it produced the following function with gcc -O3 -S for an int array


__Z14AreAllTheSame8IjEbRA8_KT_:         ## @_Z14AreAllTheSame8IjEbRA8_KT_
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp4:
	.cfi_def_cfa_offset 16
Ltmp5:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp6:
	.cfi_def_cfa_register %rbp
	movl	(%rdi), %eax
	cmpl	4(%rdi), %eax
	sete	%sil
	cmpl	8(%rdi), %eax
	sete	%dl
	cmpl	12(%rdi), %eax
	sete	%cl
	andb	%dl, %cl
	cmpl	16(%rdi), %eax
	sete	%dl
	andb	%cl, %dl
	cmpl	20(%rdi), %eax
	sete	%cl
	andb	%dl, %cl
	cmpl	24(%rdi), %eax
	sete	%dl
	andb	%cl, %dl
	cmpl	28(%rdi), %eax
	sete	%al
	andb	%dl, %al
	andb	%sil, %al
	popq	%rbp
	retq
	.cfi_endproc

For 8bit and 16bit integers it generated an additional movzwl or movzbl per array entry respectively, which you could manually try to optimize by building a wider comparison value (well, why not for 32bit values too if compiling for 64bit platform). Using the idea from Álvaro's post we get this which reduces the function down to only a handful of instructions:


template<>
bool __attribute__ ((noinline)) AreAllTheSame8( const uint8_t (&array)[8] ) {
    uint64_t v = *reinterpret_cast<const uint64_t*>(&array);
    return v == ((v & 0xffull) * 0x0101010101010101ull);
}

__Z14AreAllTheSame8IhEbRA8_KT_:         ## @_Z14AreAllTheSame8IhEbRA8_KT_
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp0:
	.cfi_def_cfa_offset 16
Ltmp1:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp2:
	.cfi_def_cfa_register %rbp
	movq	(%rdi), %rax
	movzbl	%al, %ecx
	movabsq	$72340172838076673, %rdx ## imm = 0x101010101010101
	imulq	%rcx, %rdx
	cmpq	%rdx, %rax
	sete	%al
	popq	%rbp
	retq
	.cfi_endproc

As you can see, no branches, no need for silly trickery. Just avoid the short-circuiting conditional && and || and the compiler will probably do fine.

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

As you can see, no branches, no need for silly trickery. Just avoid the short-circuiting conditional && and || and the compiler will probably do fine.

Does the shortcircuiting matter here actually? I imagine the compiler should be able to optimize out cases where it notices that operations can be done in either order and it not affect the outcome.

EDIT: did some quick tests myself and... well, GCC does indeed fail to optimize that out. Whoops. Also pretty disappointed overall this time, GCC really didn't try to do anything clever regardless of what I threw at it. I guess there isn't much else to be done about it (assuming the code is already cache coherent). Then again I recall reading that CPUs now try to keep track of where recent branches tend to go? On code that's run on a tight loop that should already help reduce branch mispredictions after a few iterations.

Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.


Yes, memory alignment is an issue, and getting reinterpret_cast or unions in the mix is almost certainly going to invoke undefined behavior. But if you often need to read 8 8-bit integers in your voxel engine, perhaps you should consider packing the 8 values in a 64-bit integer anyway, so you can best use your hardware.


That's a special case if you ignore the templated type of the function. For 8-bit values and even 16-bit values there are some bitwise hacks, but they don't really buy anything on a modern processor, and can even cause pipeline stalls if you aren't careful. Introducing data dependencies slows things down a tiny bit. Again, this is something so small it isn't a performance concern. Given the way current processors work it is quite literally two cycles through the decoder -- each CPU can decode up to 4 instructions per cycle -- and whatever cost to get the data from cache.


I happen to have quite a bit of experience in chess programming, and I can assure you that 64-bit trickery is very fast. Chess is a perfect example of a situation where using 64-bit integers to make good use of your processor is a good idea (see bitboards). This is also the case for checkers, Reversi and Four in a Row. If you are writing an engine for a board game, you can expect this is what your processor is going to be doing all day long, so performance does matter, and being able to use parallelization at the level of the ALU is huge.

The OP explicitly wanted to know if there is "some clever bit manipulation". So yes, there is. Whether it's worth doing in his program or not is impossible for me to determine.

This topic is closed to new replies.

Advertisement