# A quick way to determine if 8 small integers are all equal to each other

This topic is 418 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Given eight integers, what is the fastest way to check if they are all the same or not?

Maybe, some clever bit manipulation?

(This is used in a voxel engine - integers are material IDs: if a cell has different material IDs at corners, then it must intersect a surface.)

I've implemented this simple solution, but it only works with 8-bit values, and 255 materials (0 == empty air) won't be enough:

/// returns true if all eight bytes are equal to each other
template< typename T >
bool AreAllTheSame8( const T (&_array)[8] )
{
const T v0 = _array[ 0 ];
const T v1 = _array[ 1 ];
const T v2 = _array[ 2 ];
const T v3 = _array[ 3 ];
const T v4 = _array[ 4 ];
const T v5 = _array[ 5 ];
const T v6 = _array[ 6 ];
const T v7 = _array[ 7 ];

static_assert(sizeof(T) == sizeof(uint8_t));
uint8_t    counts[256];
mxZERO_OUT(counts);

counts[ v0 ]++;
counts[ v1 ]++;
counts[ v2 ]++;
counts[ v3 ]++;
counts[ v4 ]++;
counts[ v5 ]++;
counts[ v6 ]++;
counts[ v7 ]++;
return counts[ v0 ] == 8;
}


In reality, the zero materials are not 'counted' (i can explain why if you're curious), and the above solution results in this ugly and inefficient code:

/// returns true if all eight materials are equal to each other
bool IsCellHomogeneous( const MaterialID (&_voxels)[8] )
{
const MaterialID v0 = _voxels[ 0 ];
const MaterialID v1 = _voxels[ 1 ];
const MaterialID v2 = _voxels[ 2 ];
const MaterialID v3 = _voxels[ 3 ];
const MaterialID v4 = _voxels[ 4 ];
const MaterialID v5 = _voxels[ 5 ];
const MaterialID v6 = _voxels[ 6 ];
const MaterialID v7 = _voxels[ 7 ];

U8    counts[256];
mxZERO_OUT(counts);

UINT numSolidMaterials = 0;
UINT lastSolidMaterial = 0;

// zero material IDs correspond to empty space/air
if( v0 ) {
counts[ v0 ]++;
numSolidMaterials++;
lastSolidMaterial = v0;
}
if( v1 ) {
counts[ v1 ]++;
numSolidMaterials++;
lastSolidMaterial = v1;
}
if( v2 ) {
counts[ v2 ]++;
numSolidMaterials++;
lastSolidMaterial = v2;
}
if( v3 ) {
counts[ v3 ]++;
numSolidMaterials++;
lastSolidMaterial = v3;
}
if( v4 ) {
counts[ v4 ]++;
numSolidMaterials++;
lastSolidMaterial = v4;
}
if( v5 ) {
counts[ v5 ]++;
numSolidMaterials++;
lastSolidMaterial = v5;
}
if( v6 ) {
counts[ v6 ]++;
numSolidMaterials++;
lastSolidMaterial = v6;
}
if( v7 ) {
counts[ v7 ]++;
numSolidMaterials++;
lastSolidMaterial = v7;
}
return counts[ lastSolidMaterial ] == numSolidMaterials;
}


##### Share on other sites

Wrote in N++ without testing so I'm not sure if it works correctly.

bool IsCellHomogeneous( const MaterialID (&_voxels)[8] ) {
size_t idx = 0;

// skip all 0
for(; idx < 8 && _voxels[idx] == 0; ++idx); // intentionally empty

if(idx == 8)
return true; // everything is air

const MaterialID expectedVal = _voxels[idx];

for(; idx < 8 && (_voxels[idx] == 0 || _voxels[idx] == expectedVal); ++idx); // intentionally empty

if(idx == 8)
return true; // everything is same or 0

return false;
}

Edited by Zaoshi Kaba

##### Share on other sites
That's a special case if you ignore the templated type of the function. For 8-bit values and even 16-bit values there are some bitwise hacks, but they don't really buy anything on a modern processor, and can even cause pipeline stalls if you aren't careful. Introducing data dependencies slows things down a tiny bit.

Again, this is something so small it isn't a performance concern. Given the way current processors work it is quite literally two cycles through the decoder -- each CPU can decode up to 4 instructions per cycle -- and whatever cost to get the data from cache.

##### Share on other sites

So, I take it you want to avoid branch mispredictions messing up things. As long as you only do an assignment in the condition block, the compiler (assuming a release build) will be able to do it without branches.

I just tried the following with GCC (you would remove the noinline in production code obviously)

template< typename T >
bool __attribute__ ((noinline)) AreAllTheSame8( const T (&array)[8] ) {
bool same = true;
if (array[0]!=array[1]) same = false;
if (array[0]!=array[2]) same = false;
if (array[0]!=array[3]) same = false;
if (array[0]!=array[4]) same = false;
if (array[0]!=array[5]) same = false;
if (array[0]!=array[6]) same = false;
if (array[0]!=array[7]) same = false;
return same;
}


And it produced the following function with gcc -O3 -S for an int array

__Z14AreAllTheSame8IjEbRA8_KT_:         ## @_Z14AreAllTheSame8IjEbRA8_KT_
.cfi_startproc
## BB#0:
pushq	%rbp
Ltmp4:
.cfi_def_cfa_offset 16
Ltmp5:
.cfi_offset %rbp, -16
movq	%rsp, %rbp
Ltmp6:
.cfi_def_cfa_register %rbp
movl	(%rdi), %eax
cmpl	4(%rdi), %eax
sete	%sil
cmpl	8(%rdi), %eax
sete	%dl
cmpl	12(%rdi), %eax
sete	%cl
andb	%dl, %cl
cmpl	16(%rdi), %eax
sete	%dl
andb	%cl, %dl
cmpl	20(%rdi), %eax
sete	%cl
andb	%dl, %cl
cmpl	24(%rdi), %eax
sete	%dl
andb	%cl, %dl
cmpl	28(%rdi), %eax
sete	%al
andb	%dl, %al
andb	%sil, %al
popq	%rbp
retq
.cfi_endproc


For 8bit and 16bit integers it generated an additional movzwl or movzbl per array entry respectively, which you could manually try to optimize by building a wider comparison value (well, why not for 32bit values too if compiling for 64bit platform). Using the idea from Álvaro's post we get this which reduces the function down to only a handful of instructions:

template<>
bool __attribute__ ((noinline)) AreAllTheSame8( const uint8_t (&array)[8] ) {
uint64_t v = *reinterpret_cast<const uint64_t*>(&array);
return v == ((v & 0xffull) * 0x0101010101010101ull);
}

__Z14AreAllTheSame8IhEbRA8_KT_:         ## @_Z14AreAllTheSame8IhEbRA8_KT_
.cfi_startproc
## BB#0:
pushq	%rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq	%rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
movq	(%rdi), %rax
movzbl	%al, %ecx
movabsq	\$72340172838076673, %rdx ## imm = 0x101010101010101
imulq	%rcx, %rdx
cmpq	%rdx, %rax
sete	%al
popq	%rbp
retq
.cfi_endproc


As you can see, no branches, no need for silly trickery. Just avoid the short-circuiting conditional && and || and the compiler will probably do fine.

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

Edited by mzbear

##### Share on other sites
As you can see, no branches, no need for silly trickery. Just avoid the short-circuiting conditional && and || and the compiler will probably do fine.

Does the shortcircuiting matter here actually? I imagine the compiler should be able to optimize out cases where it notices that operations can be done in either order and it not affect the outcome.

EDIT: did some quick tests myself and... well, GCC does indeed fail to optimize that out. Whoops. Also pretty disappointed overall this time, GCC really didn't try to do anything clever regardless of what I threw at it. I guess there isn't much else to be done about it (assuming the code is already cache coherent). Then again I recall reading that CPUs now try to keep track of where recent branches tend to go? On code that's run on a tight loop that should already help reduce branch mispredictions after a few iterations.

Edited by Sik_the_hedgehog

##### Share on other sites

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

Yes, memory alignment is an issue, and getting reinterpret_cast or unions in the mix is almost certainly going to invoke undefined behavior. But if you often need to read 8 8-bit integers in your voxel engine, perhaps you should consider packing the 8 values in a 64-bit integer anyway, so you can best use your hardware.

That's a special case if you ignore the templated type of the function. For 8-bit values and even 16-bit values there are some bitwise hacks, but they don't really buy anything on a modern processor, and can even cause pipeline stalls if you aren't careful. Introducing data dependencies slows things down a tiny bit. Again, this is something so small it isn't a performance concern. Given the way current processors work it is quite literally two cycles through the decoder -- each CPU can decode up to 4 instructions per cycle -- and whatever cost to get the data from cache.

I happen to have quite a bit of experience in chess programming, and I can assure you that 64-bit trickery is very fast. Chess is a perfect example of a situation where using 64-bit integers to make good use of your processor is a good idea (see bitboards). This is also the case for checkers, Reversi and Four in a Row. If you are writing an engine for a board game, you can expect this is what your processor is going to be doing all day long, so performance does matter, and being able to use parallelization at the level of the ALU is huge.

The OP explicitly wanted to know if there is "some clever bit manipulation". So yes, there is. Whether it's worth doing in his program or not is impossible for me to determine.

##### Share on other sites

Given eight integers, what is the fastest way to check if they are all the same or not? Maybe, some clever bit manipulation?

Following appeals to me.

bool EightNumbersEqual(a,b,c,d,e,f,g,h)

{

return (a|b|c|d|e|f|g|h)==a ? true : false;

}

You can expand it to unknown amount of checked numbers, and I believe it will work for any numerical type since of the bits->number determinism.

##### Share on other sites

return (a|b|c|d|e|f|g|h)==a ? true : false;

Nope. That will lead to all kinds of fun bugs.  Any value can be any value as long as it sets no more bits than were set by a.

If a == 0x01, then values can be either 1 or 0.

If a == 0xff, then any value is allowed in all the other values.

(Extend it to whatever length or size is needed, maybe they're 8-bit chars, maybe they're 512 bit AVX registers, maybe they're some custom bignum class.)

Yep, so true!

##### Share on other sites

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

Assuming the array is originally a char*, there is no violation of the strict aliasing rule, as char* is the only datatype that can alias to anything. Though you are right we would have to concern about the compiler not blowing up the performance part; it may be faster/safer to just keep it always as uint64_t and use "& 0xFF" to get the table index due to language shenanigans.

##### Share on other sites

Wow, thanks for so many high-quality answers!!

I'll go with Álvaro's (brilliant) trick for now!

(I needed a fast function, because it's executed for each cell out of 33^3 (in each voxel chunk). If voxels at the cell's corners are different, then a material transition is taking place and the cell must contain a surface vertex.)

##### Share on other sites

bool EightNumbersEqual(a,b,c,d,e,f,g,h)

{

return (a^b|a^c|a^d|a^e|a^f|a^g|a^h)==0 ? true : false;

}

Guys, I strongly believe this one works, but my question is... would this be any faster/slower than a single integer number multiplication?

##### Share on other sites

I needed a fast function, because it's executed for each cell out of 33^3 (in each voxel chunk). If voxels at the cell's corners are different, then a material transition is taking place and the cell must contain a surface vertex.

Profiler or it didn't happen.

##### Share on other sites

My favourite quick-prototyping tool for deciding whether or not a code rewrite has the intended effect on what the compiler produces:

Frob's loop code: https://godbolt.org/g/HEFMYz

Frob's manually unrolled code: https://godbolt.org/g/MZrQRH

JonnyCode's obfuscated XOR trickery: https://godbolt.org/g/3UAbN5

I'll admit that what JonnyCode's written is what you want, if you want a constant-time algorithm (which is something you want in some algorithms, to prevent a certain class of cryptographic attack)... but as you can see, the transformation from "do what I mean" code to the less-readable "do what I say" code does not have a noticeable effect on total instruction count or (admittedly eyeballed) clock cycles. In the worst case scenarios, there's just as many register/ALU round trips, and there's just as many fetches made.

But as Khatharr says, as will every other developer worth his salt: you don't know where the actual time is being spent in your algorithms until you profile it. Even when you guess correctly, it is only a guess. Analysis of evidence is more valuable than analysis of theory.

##### Share on other sites

This topic is 418 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account