Sign in to follow this  

A quick way to determine if 8 small integers are all equal to each other

This topic is 418 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Given eight integers, what is the fastest way to check if they are all the same or not?

Maybe, some clever bit manipulation?

(This is used in a voxel engine - integers are material IDs: if a cell has different material IDs at corners, then it must intersect a surface.)

 

I've implemented this simple solution, but it only works with 8-bit values, and 255 materials (0 == empty air) won't be enough:

/// returns true if all eight bytes are equal to each other
template< typename T >
bool AreAllTheSame8( const T (&_array)[8] )
{
    const T v0 = _array[ 0 ];
    const T v1 = _array[ 1 ];
    const T v2 = _array[ 2 ];
    const T v3 = _array[ 3 ];
    const T v4 = _array[ 4 ];
    const T v5 = _array[ 5 ];
    const T v6 = _array[ 6 ];
    const T v7 = _array[ 7 ];

    static_assert(sizeof(T) == sizeof(uint8_t));
    uint8_t    counts[256];
    mxZERO_OUT(counts);

    counts[ v0 ]++;
    counts[ v1 ]++;
    counts[ v2 ]++;
    counts[ v3 ]++;
    counts[ v4 ]++;
    counts[ v5 ]++;
    counts[ v6 ]++;
    counts[ v7 ]++;
    return counts[ v0 ] == 8;
}

In reality, the zero materials are not 'counted' (i can explain why if you're curious), and the above solution results in this ugly and inefficient code:

/// returns true if all eight materials are equal to each other
bool IsCellHomogeneous( const MaterialID (&_voxels)[8] )
{
    const MaterialID v0 = _voxels[ 0 ];
    const MaterialID v1 = _voxels[ 1 ];
    const MaterialID v2 = _voxels[ 2 ];
    const MaterialID v3 = _voxels[ 3 ];
    const MaterialID v4 = _voxels[ 4 ];
    const MaterialID v5 = _voxels[ 5 ];
    const MaterialID v6 = _voxels[ 6 ];
    const MaterialID v7 = _voxels[ 7 ];

    U8    counts[256];
    mxZERO_OUT(counts);

    UINT numSolidMaterials = 0;
    UINT lastSolidMaterial = 0;

    // zero material IDs correspond to empty space/air
    if( v0 ) {
        counts[ v0 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v0;
    }
    if( v1 ) {
        counts[ v1 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v1;
    }
    if( v2 ) {
        counts[ v2 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v2;
    }
    if( v3 ) {
        counts[ v3 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v3;
    }
    if( v4 ) {
        counts[ v4 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v4;
    }
    if( v5 ) {
        counts[ v5 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v5;
    }
    if( v6 ) {
        counts[ v6 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v6;
    }
    if( v7 ) {
        counts[ v7 ]++;
        numSolidMaterials++;
        lastSolidMaterial = v7;
    }
    return counts[ lastSolidMaterial ] == numSolidMaterials;
}

Share this post


Link to post
Share on other sites

Wrote in N++ without testing so I'm not sure if it works correctly.

bool IsCellHomogeneous( const MaterialID (&_voxels)[8] ) {
	size_t idx = 0;
	
	// skip all 0
	for(; idx < 8 && _voxels[idx] == 0; ++idx); // intentionally empty
	
	if(idx == 8)
		return true; // everything is air
	
	const MaterialID expectedVal = _voxels[idx];
	
	for(; idx < 8 && (_voxels[idx] == 0 || _voxels[idx] == expectedVal); ++idx); // intentionally empty
	
	if(idx == 8)
		return true; // everything is same or 0
	
	return false;
}
Edited by Zaoshi Kaba

Share this post


Link to post
Share on other sites
That's a special case if you ignore the templated type of the function. For 8-bit values and even 16-bit values there are some bitwise hacks, but they don't really buy anything on a modern processor, and can even cause pipeline stalls if you aren't careful. Introducing data dependencies slows things down a tiny bit.

Again, this is something so small it isn't a performance concern. Given the way current processors work it is quite literally two cycles through the decoder -- each CPU can decode up to 4 instructions per cycle -- and whatever cost to get the data from cache.

Share this post


Link to post
Share on other sites

So, I take it you want to avoid branch mispredictions messing up things. As long as you only do an assignment in the condition block, the compiler (assuming a release build) will be able to do it without branches.

I just tried the following with GCC (you would remove the noinline in production code obviously)
 

template< typename T >
bool __attribute__ ((noinline)) AreAllTheSame8( const T (&array)[8] ) {
    bool same = true;
    if (array[0]!=array[1]) same = false;
    if (array[0]!=array[2]) same = false;
    if (array[0]!=array[3]) same = false;
    if (array[0]!=array[4]) same = false;
    if (array[0]!=array[5]) same = false;
    if (array[0]!=array[6]) same = false;
    if (array[0]!=array[7]) same = false;
    return same;
}

And it produced the following function with gcc -O3 -S for an int array
 

__Z14AreAllTheSame8IjEbRA8_KT_:         ## @_Z14AreAllTheSame8IjEbRA8_KT_
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp4:
	.cfi_def_cfa_offset 16
Ltmp5:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp6:
	.cfi_def_cfa_register %rbp
	movl	(%rdi), %eax
	cmpl	4(%rdi), %eax
	sete	%sil
	cmpl	8(%rdi), %eax
	sete	%dl
	cmpl	12(%rdi), %eax
	sete	%cl
	andb	%dl, %cl
	cmpl	16(%rdi), %eax
	sete	%dl
	andb	%cl, %dl
	cmpl	20(%rdi), %eax
	sete	%cl
	andb	%dl, %cl
	cmpl	24(%rdi), %eax
	sete	%dl
	andb	%cl, %dl
	cmpl	28(%rdi), %eax
	sete	%al
	andb	%dl, %al
	andb	%sil, %al
	popq	%rbp
	retq
	.cfi_endproc

For 8bit and 16bit integers it generated an additional movzwl or movzbl per array entry respectively, which you could manually try to optimize by building a wider comparison value (well, why not for 32bit values too if compiling for 64bit platform). Using the idea from Álvaro's post we get this which reduces the function down to only a handful of instructions:
 

template<>
bool __attribute__ ((noinline)) AreAllTheSame8( const uint8_t (&array)[8] ) {
    uint64_t v = *reinterpret_cast<const uint64_t*>(&array);
    return v == ((v & 0xffull) * 0x0101010101010101ull);
}
__Z14AreAllTheSame8IhEbRA8_KT_:         ## @_Z14AreAllTheSame8IhEbRA8_KT_
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp0:
	.cfi_def_cfa_offset 16
Ltmp1:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp2:
	.cfi_def_cfa_register %rbp
	movq	(%rdi), %rax
	movzbl	%al, %ecx
	movabsq	$72340172838076673, %rdx ## imm = 0x101010101010101
	imulq	%rcx, %rdx
	cmpq	%rdx, %rax
	sete	%al
	popq	%rbp
	retq
	.cfi_endproc

As you can see, no branches, no need for silly trickery. Just avoid the short-circuiting conditional && and || and the compiler will probably do fine.

 

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

Edited by mzbear

Share this post


Link to post
Share on other sites
As you can see, no branches, no need for silly trickery. Just avoid the short-circuiting conditional && and || and the compiler will probably do fine.

Does the shortcircuiting matter here actually? I imagine the compiler should be able to optimize out cases where it notices that operations can be done in either order and it not affect the outcome.

 

EDIT: did some quick tests myself and... well, GCC does indeed fail to optimize that out. Whoops. Also pretty disappointed overall this time, GCC really didn't try to do anything clever regardless of what I threw at it. I guess there isn't much else to be done about it (assuming the code is already cache coherent). Then again I recall reading that CPUs now try to keep track of where recent branches tend to go? On code that's run on a tight loop that should already help reduce branch mispredictions after a few iterations.

Edited by Sik_the_hedgehog

Share this post


Link to post
Share on other sites

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.


Yes, memory alignment is an issue, and getting reinterpret_cast or unions in the mix is almost certainly going to invoke undefined behavior. But if you often need to read 8 8-bit integers in your voxel engine, perhaps you should consider packing the 8 values in a 64-bit integer anyway, so you can best use your hardware.

 

That's a special case if you ignore the templated type of the function. For 8-bit values and even 16-bit values there are some bitwise hacks, but they don't really buy anything on a modern processor, and can even cause pipeline stalls if you aren't careful. Introducing data dependencies slows things down a tiny bit. Again, this is something so small it isn't a performance concern. Given the way current processors work it is quite literally two cycles through the decoder -- each CPU can decode up to 4 instructions per cycle -- and whatever cost to get the data from cache.


I happen to have quite a bit of experience in chess programming, and I can assure you that 64-bit trickery is very fast. Chess is a perfect example of a situation where using 64-bit integers to make good use of your processor is a good idea (see bitboards). This is also the case for checkers, Reversi and Four in a Row. If you are writing an engine for a board game, you can expect this is what your processor is going to be doing all day long, so performance does matter, and being able to use parallelization at the level of the ALU is huge.

The OP explicitly wanted to know if there is "some clever bit manipulation". So yes, there is. Whether it's worth doing in his program or not is impossible for me to determine.

Share this post


Link to post
Share on other sites

Given eight integers, what is the fastest way to check if they are all the same or not? Maybe, some clever bit manipulation?

Following appeals to me.

 

bool EightNumbersEqual(a,b,c,d,e,f,g,h)

{

    return (a|b|c|d|e|f|g|h)==a ? true : false;

}

 

You can expand it to unknown amount of checked numbers, and I believe it will work for any numerical type since of the bits->number determinism.

Share this post


Link to post
Share on other sites

return (a|b|c|d|e|f|g|h)==a ? true : false;
 

 

Nope. That will lead to all kinds of fun bugs.  Any value can be any value as long as it sets no more bits than were set by a.

 

If a == 0x01, then values can be either 1 or 0.

 

If a == 0xff, then any value is allowed in all the other values.

 

(Extend it to whatever length or size is needed, maybe they're 8-bit chars, maybe they're 512 bit AVX registers, maybe they're some custom bignum class.)

Share this post


Link to post
Share on other sites

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

Assuming the array is originally a char*, there is no violation of the strict aliasing rule, as char* is the only datatype that can alias to anything. Though you are right we would have to concern about the compiler not blowing up the performance part; it may be faster/safer to just keep it always as uint64_t and use "& 0xFF" to get the table index due to language shenanigans.

Share this post


Link to post
Share on other sites

Wow, thanks for so many high-quality answers!!

 

I'll go with Álvaro's (brilliant) trick for now!

 

(I needed a fast function, because it's executed for each cell out of 33^3 (in each voxel chunk). If voxels at the cell's corners are different, then a material transition is taking place and the cell must contain a surface vertex.)

Share this post


Link to post
Share on other sites

bool EightNumbersEqual(a,b,c,d,e,f,g,h)

{

    return (a^b|a^c|a^d|a^e|a^f|a^g|a^h)==0 ? true : false;

}

 

Guys, I strongly believe this one works, but my question is... would this be any faster/slower than a single integer number multiplication?

Share this post


Link to post
Share on other sites

I needed a fast function, because it's executed for each cell out of 33^3 (in each voxel chunk). If voxels at the cell's corners are different, then a material transition is taking place and the cell must contain a surface vertex.

 

Profiler or it didn't happen.

Share this post


Link to post
Share on other sites

My favourite quick-prototyping tool for deciding whether or not a code rewrite has the intended effect on what the compiler produces:

 

Frob's loop code: https://godbolt.org/g/HEFMYz

 

Frob's manually unrolled code: https://godbolt.org/g/MZrQRH

 

JonnyCode's obfuscated XOR trickery: https://godbolt.org/g/3UAbN5

 

I'll admit that what JonnyCode's written is what you want, if you want a constant-time algorithm (which is something you want in some algorithms, to prevent a certain class of cryptographic attack)... but as you can see, the transformation from "do what I mean" code to the less-readable "do what I say" code does not have a noticeable effect on total instruction count or (admittedly eyeballed) clock cycles. In the worst case scenarios, there's just as many register/ALU round trips, and there's just as many fetches made.

 

But as Khatharr says, as will every other developer worth his salt: you don't know where the actual time is being spent in your algorithms until you profile it. Even when you guess correctly, it is only a guess. Analysis of evidence is more valuable than analysis of theory.

Share this post


Link to post
Share on other sites

This topic is 418 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this