Sign in to follow this  
Mantear

Bit-counting trick - new to me

Recommended Posts

Greetings, I was looking for the best way to count the number of bits set in a 32-bit number, and found this:
unsigned int BitCount(unsigned int v)
{
   v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
   v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
   return ((v + (v >> 4) & 0x0F0F0F0F) * 0x01010101) >> 24; // count
}
I had never seen this before and was wondering if it was a commonly known method or not. I think it's rather impressive. I also came across another method called MIT HAKMEM, but the modulo 63 seems to kill its performance.

Share this post


Link to post
Share on other sites
I've seen something like that before, though it had more steps. Very similar though. I like Kernighan's method personally, easy to understand and doesn't check useless bits:


unsigned int val, cnt;
for (cnt = 0; val; cnt++)
val &= val - 1;


Unfortunately they all pretty much suck since it's usually one long dependency chain, which is the justification behind the population count instruction. [grin]

Share this post


Link to post
Share on other sites
Quote:
Original post by Mantear
I also came across another method called MIT HAKMEM, but the modulo 63 seems to kill its performance.


That's not the name of the method; it's the name of a series of articles, one of which presented a method, among other things.

Share this post


Link to post
Share on other sites
For what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.

Share this post


Link to post
Share on other sites
Quote:
Original post by cache_hit
For what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.


What about a 64k lookup table that you index into twice?

Also, some architectures have a popcount instruction, which is likely much faster.

Share this post


Link to post
Share on other sites
Quote:
Original post by cache_hit
For what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.
That would depend on just how fast the processor is at doing memory lookups, or didn't we learn anything from the "Can this be optimized?" thread? Besides 16-bit LUTs would be faster on quite a few systems.

Aside from processors with population count instructions you can also get a lot of performance from the parallel shift-and-add trick on large bit vectors. That is by unrolling using SIMD instructions you can get a high degree of instruction-level parallelism as well as conveniently process large words within a single instruction, not to mention that the later iterations for dealing with the larger bit chunks need only be done once for several input words.


Besides which I'm disappointed in you people, I thought I had found the perfect thread to sneak in that earth-shattering link but no one noticed. What is the Internet coming to if you can't get people to click on random links?

edit: Beaten to the punch by RDragon1. How did I not notice your post?

Share this post


Link to post
Share on other sites
Quote:
Original post by cache_hit
For what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.


The article already posted above (http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) says the code from the OP really is the fastest, faster than the lookup table.

I still don't get how it works though, or what numbers you have to put in there to make it work with 64-bit or 128-bit integers.

Maybe those numbers are some sort of bit filters? Because 0x55555555 is 01010101010101010101010101010101, 0x33333333 is 00110011001100110011001100110011, 0x0F0F0F0F is 00001111000011110000111100001111, etc...

Share this post


Link to post
Share on other sites
Quote:
Original post by RDragon1
Quote:
Original post by cache_hit
For what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.


What about a 64k lookup table that you index into twice?

Also, some architectures have a popcount instruction, which is likely much faster.


Alright alright you win. Just that 64k is orders of magnitude larger than 256 bytes, often the tradeoff is not worth it. You could also make a 32MB lookup table if you really wanted to :)


The link showing the different methods is interesting. I'd have to benchmark it to be convinced that the method is faster than a lookup table. I can see it possibly being faster when the lookup table is small relative to the number of bits in the integer being counted (for example a 256-byte lookup table on a 32-bit integer), but not when the lookup table is relatively near in size to the number of bits in the integer.

Share this post


Link to post
Share on other sites
Out of curiosity have any of you guys found any interesting uses for the population count operation? I've only ever used it to deal with sparse array bitmaps myself.

Collectors of bit-hacks seem to spend an inordinate amount of effort on it though so I suppose it must have some other uses..

Share this post


Link to post
Share on other sites
Quote:
Original post by implicit
Collectors of bit-hacks seem to spend an inordinate amount of effort on it though so I suppose it must have some other uses..


It could be, but I don't think that's required :)

Share this post


Link to post
Share on other sites
Quote:
Original post by LodeI still don't get how it works though, or what numbers you have to put in there to make it work with 64-bit or 128-bit integers.


Here's my try at explaining it:


Here's the C code in an easier to read format:
Taken from http://bits.stephan-brumme.com/countBits.html

01: unsigned int countBits(unsigned int x)
02: {
03: // count bits of each 2-bit chunk
04: x = x - ((x >> 1) & 0x55555555);
05: // count bits of each 4-bit chunk
06: x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
07: // count bits of each 8-bit chunk
08: x = x + (x >> 4);
09: // mask out junk
10: x &= 0xF0F0F0F;
11: // add all four 8-bit chunks
12: return (x * 0x01010101) >> 24;
13: }

I'll go through it step by step with an 8-bit example

Bit labels: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
Number: 1 1 0 0 1 0 0 1


Step 1: Count the number of bits in every 2-bit chunk.
This means we want to count how many bits are set in [1,0], [3,2], [5,4], and [7,6].
The result will be put into the same bit slots that we are counting.

Mask used: 0 1 0 1 0 1 0 1
So for every two bits, the mask is 01

Notice how the substraction gives the correct value for every possible bit value.

Bits 1, 0: 01
- 00 (01 shifted to the right one then ANDED with 01)
-----
01

Bits 3, 2: 10
- 01 (10 shifted to the right one then ANDED with 01)
-----
01

Bits 5, 4: 00
- 00 (00 shifted to the right one then ANDED with 01)
-----
00

Bits 7, 6: 11
- 01 (11 shifted to the right one then ANDED with 01)
-----
10

Value of x after Step 1: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
1 0 0 0 0 1 0 1

Every two bits now contains the number of bits that were set in the corresponding two bits of the original value.


Step 2: Add each 2-bit count to its neighbor.

This means we want to add [1,0] to [3,2] and [7,6] to [5,4].

Mask used: 0 0 1 1 0 0 1 1
The mask is used to clear bits so that the final result can reside in the four adjacent bits.
So [3,2] + [1,0] will reside in [3,2,1,0] and [7,6] + [5,4] will be in [7,6,5,4].

Bits 3,2,1,0: 0001 (0101 ANDED with 0011)
+ 0001 (0101 shifted right 2, then ANDED with 0011)
-----
0010


Bits 7,6,5,4: 0000 (1000 ANDED with 0011)
+ 0010 (1000 shifted right 2, then ANDED with 0011)
-----
0010

Value of x after Step 2: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 0 1 0 0 0 1 0

Every four bits now contains the number of bits that were set in the corresponding 4 bits of the original value.

Step 3: Add each 4-bit count to its neighbor.

This means we want to add [3,2,1,0] to [7,6,5,4].

No mask needed.
The result will be placed in the corresponding 8 bits.

Bits 7,6,5,4,3,2,1,0: 00100010
+ 00000010 (00100010 shifted right by 4)
----------
00100100

Value of x after step 3: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 0 1 0 0 1 0 0

The lower four bits of each byte now contains the number of bits set in the corresponding byte of the original value.

Step 4: Mask out top 4 bits of every byte, which is junk data.

Mask used: 1111000

Value of x after step 4: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 0 0 0 0 1 0 0

Step 5: Add lower 4 bits of every byte together.

For a single byte, this step is not really necessary. The final result is already available, which is 4.

So for step 5 and 6, we will switch over to 32-bit and a made up intermediate result shown below.
For any junk of data larger than a byte, then we need to do this.
What we have at the end of step 4 is the number of bits set in each byte, which are located in the lower 4 bits of each byte.
We want to add together all these numbers. Multiplying by 0x01010101 accomplishes this because multiplication is addition with a few shifts included.
The result will end up being in the most significant byte.
Here is what it looks like, X denotes a 0 added because of a shift:

00000100 00000011 00000110 00000001 (intermediate result, current x value)
x 00000001 00000001 00000001 00000001 (0x01010101 to be multiplied to intermediate result)
------------------------------------
|00000100|00000011 00000110 00000001
0|00000000|00000000 00000000 0000000X
00|00000000|00000000 00000000 000000XX
000|00000000|00000000 00000000 00000XXX
0000|00000000|00000000 00000000 0000XXXX
00000|00000000|00000000 00000000 000XXXXX
000000|00000000|00000000 00000000 00XXXXXX
0000000|00000000|00000000 00000000 0XXXXXXX
00000100|00000011|00000110 00000001 XXXXXXXX
| |
0 00000000|00000000|00000000 0000000X XXXXXXXX
00 00000000|00000000|00000000 000000XX XXXXXXXX
000 00000000|00000000|00000000 00000XXX XXXXXXXX
0000 00000000|00000000|00000000 0000XXXX XXXXXXXX
00000 00000000|00000000|00000000 000XXXXX XXXXXXXX
000000 00000000|00000000|00000000 00XXXXXX XXXXXXXX
0000000 00000000|00000000|00000000 0XXXXXXX XXXXXXXX
00000100 00000011|00000110|00000001 XXXXXXXX XXXXXXXX
| |
0 00000000 00000000|00000000|0000000X XXXXXXXX XXXXXXXX
00 00000000 00000000|00000000|000000XX XXXXXXXX XXXXXXXX
000 00000000 00000000|00000000|00000XXX XXXXXXXX XXXXXXXX
0000 00000000 00000000|00000000|0000XXXX XXXXXXXX XXXXXXXX
00000 00000000 00000000|00000000|000XXXXX XXXXXXXX XXXXXXXX
000000 00000000 00000000|00000000|00XXXXXX XXXXXXXX XXXXXXXX
0000000 00000000 00000000|00000000|0XXXXXXX XXXXXXXX XXXXXXXX
00000100 00000011 00000110|00000001|XXXXXXXX XXXXXXXX XXXXXXXX
--------------------------------------------------------------
Final result: | |
[ All overflow, ignored ]|00001110|00000000 00000000 00000000

Value of x after step 5: 00001110 00000000 00000000 00000000

Step 6: Shift x right by 24 bits to put count into the lowest byte.

Final result: 00000000 00000000 00000000 00001110 (equal to 14)

Now the result can be read from x like any other integer.





Share this post


Link to post
Share on other sites
Quote:
Out of curiosity have any of you guys found any interesting uses for the population count operation? I've only ever used it to deal with sparse array bitmaps myself.

In the past, I've needed it for chess bitboards, processor affinity masks and memory allocator bitmaps. If the census transform turns out to be useful for the current stereo matching task, then that'd a good (very time-critical) example, too.

Share this post


Link to post
Share on other sites
Quote:
Original post by Jan Wassenberg
Quote:
Out of curiosity have any of you guys found any interesting uses for the population count operation? I've only ever used it to deal with sparse array bitmaps myself.

In the past, I've needed it for chess bitboards, processor affinity masks and memory allocator bitmaps. If the census transform turns out to be useful for the current stereo matching task, then that'd a good (very time-critical) example, too.


From Wikipedia: "Cray supercomputers early on featured a population count machine instruction, rumoured to have been specifically requested by the U.S. government National Security Agency for cryptanalysis applications."

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this