A quick way to determine if 8 small integers are all equal to each other

This topic is 723 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Given eight integers, what is the fastest way to check if they are all the same or not?

Maybe, some clever bit manipulation?

(This is used in a voxel engine - integers are material IDs: if a cell has different material IDs at corners, then it must intersect a surface.)

I've implemented this simple solution, but it only works with 8-bit values, and 255 materials (0 == empty air) won't be enough:

/// returns true if all eight bytes are equal to each other
template< typename T >
bool AreAllTheSame8( const T (&_array)[8] )
{
const T v0 = _array[ 0 ];
const T v1 = _array[ 1 ];
const T v2 = _array[ 2 ];
const T v3 = _array[ 3 ];
const T v4 = _array[ 4 ];
const T v5 = _array[ 5 ];
const T v6 = _array[ 6 ];
const T v7 = _array[ 7 ];

static_assert(sizeof(T) == sizeof(uint8_t));
uint8_t    counts[256];
mxZERO_OUT(counts);

counts[ v0 ]++;
counts[ v1 ]++;
counts[ v2 ]++;
counts[ v3 ]++;
counts[ v4 ]++;
counts[ v5 ]++;
counts[ v6 ]++;
counts[ v7 ]++;
return counts[ v0 ] == 8;
}


In reality, the zero materials are not 'counted' (i can explain why if you're curious), and the above solution results in this ugly and inefficient code:

/// returns true if all eight materials are equal to each other
bool IsCellHomogeneous( const MaterialID (&_voxels)[8] )
{
const MaterialID v0 = _voxels[ 0 ];
const MaterialID v1 = _voxels[ 1 ];
const MaterialID v2 = _voxels[ 2 ];
const MaterialID v3 = _voxels[ 3 ];
const MaterialID v4 = _voxels[ 4 ];
const MaterialID v5 = _voxels[ 5 ];
const MaterialID v6 = _voxels[ 6 ];
const MaterialID v7 = _voxels[ 7 ];

U8    counts[256];
mxZERO_OUT(counts);

UINT numSolidMaterials = 0;
UINT lastSolidMaterial = 0;

// zero material IDs correspond to empty space/air
if( v0 ) {
counts[ v0 ]++;
numSolidMaterials++;
lastSolidMaterial = v0;
}
if( v1 ) {
counts[ v1 ]++;
numSolidMaterials++;
lastSolidMaterial = v1;
}
if( v2 ) {
counts[ v2 ]++;
numSolidMaterials++;
lastSolidMaterial = v2;
}
if( v3 ) {
counts[ v3 ]++;
numSolidMaterials++;
lastSolidMaterial = v3;
}
if( v4 ) {
counts[ v4 ]++;
numSolidMaterials++;
lastSolidMaterial = v4;
}
if( v5 ) {
counts[ v5 ]++;
numSolidMaterials++;
lastSolidMaterial = v5;
}
if( v6 ) {
counts[ v6 ]++;
numSolidMaterials++;
lastSolidMaterial = v6;
}
if( v7 ) {
counts[ v7 ]++;
numSolidMaterials++;
lastSolidMaterial = v7;
}
return counts[ lastSolidMaterial ] == numSolidMaterials;
}


Share on other sites

Wrote in N++ without testing so I'm not sure if it works correctly.

bool IsCellHomogeneous( const MaterialID (&_voxels)[8] ) {
size_t idx = 0;

// skip all 0
for(; idx < 8 && _voxels[idx] == 0; ++idx); // intentionally empty

if(idx == 8)
return true; // everything is air

const MaterialID expectedVal = _voxels[idx];

for(; idx < 8 && (_voxels[idx] == 0 || _voxels[idx] == expectedVal); ++idx); // intentionally empty

if(idx == 8)
return true; // everything is same or 0

return false;
}

Edited by Zaoshi Kaba

Share on other sites
That's a special case if you ignore the templated type of the function. For 8-bit values and even 16-bit values there are some bitwise hacks, but they don't really buy anything on a modern processor, and can even cause pipeline stalls if you aren't careful. Introducing data dependencies slows things down a tiny bit.

Again, this is something so small it isn't a performance concern. Given the way current processors work it is quite literally two cycles through the decoder -- each CPU can decode up to 4 instructions per cycle -- and whatever cost to get the data from cache.

Share on other sites

So, I take it you want to avoid branch mispredictions messing up things. As long as you only do an assignment in the condition block, the compiler (assuming a release build) will be able to do it without branches.

I just tried the following with GCC (you would remove the noinline in production code obviously)

template< typename T >
bool __attribute__ ((noinline)) AreAllTheSame8( const T (&array)[8] ) {
bool same = true;
if (array[0]!=array[1]) same = false;
if (array[0]!=array[2]) same = false;
if (array[0]!=array[3]) same = false;
if (array[0]!=array[4]) same = false;
if (array[0]!=array[5]) same = false;
if (array[0]!=array[6]) same = false;
if (array[0]!=array[7]) same = false;
return same;
}


And it produced the following function with gcc -O3 -S for an int array

__Z14AreAllTheSame8IjEbRA8_KT_:         ## @_Z14AreAllTheSame8IjEbRA8_KT_
.cfi_startproc
## BB#0:
pushq	%rbp
Ltmp4:
.cfi_def_cfa_offset 16
Ltmp5:
.cfi_offset %rbp, -16
movq	%rsp, %rbp
Ltmp6:
.cfi_def_cfa_register %rbp
movl	(%rdi), %eax
cmpl	4(%rdi), %eax
sete	%sil
cmpl	8(%rdi), %eax
sete	%dl
cmpl	12(%rdi), %eax
sete	%cl
andb	%dl, %cl
cmpl	16(%rdi), %eax
sete	%dl
andb	%cl, %dl
cmpl	20(%rdi), %eax
sete	%cl
andb	%dl, %cl
cmpl	24(%rdi), %eax
sete	%dl
andb	%cl, %dl
cmpl	28(%rdi), %eax
sete	%al
andb	%dl, %al
andb	%sil, %al
popq	%rbp
retq
.cfi_endproc


For 8bit and 16bit integers it generated an additional movzwl or movzbl per array entry respectively, which you could manually try to optimize by building a wider comparison value (well, why not for 32bit values too if compiling for 64bit platform). Using the idea from Álvaro's post we get this which reduces the function down to only a handful of instructions:

template<>
bool __attribute__ ((noinline)) AreAllTheSame8( const uint8_t (&array)[8] ) {
uint64_t v = *reinterpret_cast<const uint64_t*>(&array);
return v == ((v & 0xffull) * 0x0101010101010101ull);
}

__Z14AreAllTheSame8IhEbRA8_KT_:         ## @_Z14AreAllTheSame8IhEbRA8_KT_
.cfi_startproc
## BB#0:
pushq	%rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq	%rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
movq	(%rdi), %rax
movzbl	%al, %ecx
movabsq	\$72340172838076673, %rdx ## imm = 0x101010101010101
imulq	%rcx, %rdx
cmpq	%rdx, %rax
sete	%al
popq	%rbp
retq
.cfi_endproc


As you can see, no branches, no need for silly trickery. Just avoid the short-circuiting conditional && and || and the compiler will probably do fine.

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

Edited by mzbear

Share on other sites
As you can see, no branches, no need for silly trickery. Just avoid the short-circuiting conditional && and || and the compiler will probably do fine.

Does the shortcircuiting matter here actually? I imagine the compiler should be able to optimize out cases where it notices that operations can be done in either order and it not affect the outcome.

EDIT: did some quick tests myself and... well, GCC does indeed fail to optimize that out. Whoops. Also pretty disappointed overall this time, GCC really didn't try to do anything clever regardless of what I threw at it. I guess there isn't much else to be done about it (assuming the code is already cache coherent). Then again I recall reading that CPUs now try to keep track of where recent branches tend to go? On code that's run on a tight loop that should already help reduce branch mispredictions after a few iterations.

Edited by Sik_the_hedgehog

Share on other sites

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

Yes, memory alignment is an issue, and getting reinterpret_cast or unions in the mix is almost certainly going to invoke undefined behavior. But if you often need to read 8 8-bit integers in your voxel engine, perhaps you should consider packing the 8 values in a 64-bit integer anyway, so you can best use your hardware.

That's a special case if you ignore the templated type of the function. For 8-bit values and even 16-bit values there are some bitwise hacks, but they don't really buy anything on a modern processor, and can even cause pipeline stalls if you aren't careful. Introducing data dependencies slows things down a tiny bit. Again, this is something so small it isn't a performance concern. Given the way current processors work it is quite literally two cycles through the decoder -- each CPU can decode up to 4 instructions per cycle -- and whatever cost to get the data from cache.

I happen to have quite a bit of experience in chess programming, and I can assure you that 64-bit trickery is very fast. Chess is a perfect example of a situation where using 64-bit integers to make good use of your processor is a good idea (see bitboards). This is also the case for checkers, Reversi and Four in a Row. If you are writing an engine for a board game, you can expect this is what your processor is going to be doing all day long, so performance does matter, and being able to use parallelization at the level of the ALU is huge.

The OP explicitly wanted to know if there is "some clever bit manipulation". So yes, there is. Whether it's worth doing in his program or not is impossible for me to determine.

1. 1
Rutin
26
2. 2
3. 3
4. 4
5. 5

• 11
• 10
• 13
• 20
• 14
• Forum Statistics

• Total Topics
632950
• Total Posts
3009381
• Who's Online (See full list)

There are no registered users currently online

×