# Bit-counting trick - new to me

## Recommended Posts

Greetings, I was looking for the best way to count the number of bits set in a 32-bit number, and found this:
unsigned int BitCount(unsigned int v)
{
v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
return ((v + (v >> 4) & 0x0F0F0F0F) * 0x01010101) >> 24; // count
}

I had never seen this before and was wondering if it was a commonly known method or not. I think it's rather impressive. I also came across another method called MIT HAKMEM, but the modulo 63 seems to kill its performance.

##### Share on other sites
Did you have a question about how the method works or did you just want to show us a cool hack?

If you find this sort of thing entertaining then I can warmly recommend Hacker's Delight.

##### Share on other sites
It's fairly well known. See http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive for some other options.

##### Share on other sites
I've seen something like that before, though it had more steps. Very similar though. I like Kernighan's method personally, easy to understand and doesn't check useless bits:

unsigned int val, cnt;for (cnt = 0; val; cnt++)  val &= val - 1;

Unfortunately they all pretty much suck since it's usually one long dependency chain, which is the justification behind the population count instruction. [grin]

##### Share on other sites
Quote:
 Original post by MantearI also came across another method called MIT HAKMEM, but the modulo 63 seems to kill its performance.

That's not the name of the method; it's the name of a series of articles, one of which presented a method, among other things.

##### Share on other sites
For what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.

##### Share on other sites
Quote:
 Original post by cache_hitFor what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.

What about a 64k lookup table that you index into twice?

Also, some architectures have a popcount instruction, which is likely much faster.

##### Share on other sites
Quote:
 Original post by cache_hitFor what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.
That would depend on just how fast the processor is at doing memory lookups, or didn't we learn anything from the "Can this be optimized?" thread? Besides 16-bit LUTs would be faster on quite a few systems.

Aside from processors with population count instructions you can also get a lot of performance from the parallel shift-and-add trick on large bit vectors. That is by unrolling using SIMD instructions you can get a high degree of instruction-level parallelism as well as conveniently process large words within a single instruction, not to mention that the later iterations for dealing with the larger bit chunks need only be done once for several input words.

Besides which I'm disappointed in you people, I thought I had found the perfect thread to sneak in that earth-shattering link but no one noticed. What is the Internet coming to if you can't get people to click on random links?

edit: Beaten to the punch by RDragon1. How did I not notice your post?

##### Share on other sites
Quote:
 Original post by cache_hitFor what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.

The article already posted above (http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) says the code from the OP really is the fastest, faster than the lookup table.

I still don't get how it works though, or what numbers you have to put in there to make it work with 64-bit or 128-bit integers.

Maybe those numbers are some sort of bit filters? Because 0x55555555 is 01010101010101010101010101010101, 0x33333333 is 00110011001100110011001100110011, 0x0F0F0F0F is 00001111000011110000111100001111, etc...

##### Share on other sites
Quote:
Original post by RDragon1
Quote:
 Original post by cache_hitFor what it's worth, the fastest way of computing the number of 1-bits in an integer is with a lookup table 256-byte lookup table that you index 4 times. I agree the method you posted is cool though.

What about a 64k lookup table that you index into twice?

Also, some architectures have a popcount instruction, which is likely much faster.

Alright alright you win. Just that 64k is orders of magnitude larger than 256 bytes, often the tradeoff is not worth it. You could also make a 32MB lookup table if you really wanted to :)

The link showing the different methods is interesting. I'd have to benchmark it to be convinced that the method is faster than a lookup table. I can see it possibly being faster when the lookup table is small relative to the number of bits in the integer being counted (for example a 256-byte lookup table on a 32-bit integer), but not when the lookup table is relatively near in size to the number of bits in the integer.

##### Share on other sites
Quote:
 Original post by LodeI still don't get how it works though, or what numbers you have to put in there to make it work with 64-bit or 128-bit integers.

Wikipedia explains it in detail: http://en.wikipedia.org/wiki/Hamming_weight

##### Share on other sites
Out of curiosity have any of you guys found any interesting uses for the population count operation? I've only ever used it to deal with sparse array bitmaps myself.

Collectors of bit-hacks seem to spend an inordinate amount of effort on it though so I suppose it must have some other uses..

##### Share on other sites
Quote:
 Original post by implicitCollectors of bit-hacks seem to spend an inordinate amount of effort on it though so I suppose it must have some other uses..

It could be, but I don't think that's required :)

##### Share on other sites
Quote:
 Original post by implicitOut of curiosity have any of you guys found any interesting uses for the population count operation?

parity_bit(x) = bitcount(x) & 1

##### Share on other sites
Quote:
 Original post by LodeI still don't get how it works though, or what numbers you have to put in there to make it work with 64-bit or 128-bit integers.

Here's my try at explaining it:

Here's the C code in an easier to read format:Taken from http://bits.stephan-brumme.com/countBits.html01: unsigned int countBits(unsigned int x)02: {03:   // count bits of each 2-bit chunk04:   x  = x - ((x >> 1) & 0x55555555);05:   // count bits of each 4-bit chunk06:   x  = (x & 0x33333333) + ((x >> 2) & 0x33333333);07:   // count bits of each 8-bit chunk08:   x  = x + (x >> 4);09:   // mask out junk10:   x &= 0xF0F0F0F;11:   // add all four 8-bit chunks12:   return (x * 0x01010101) >> 24;13: }I'll go through it step by step with an 8-bit exampleBit labels: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |Number:       1   1   0   0   1   0   0   1Step 1: Count the number of bits in every 2-bit chunk.  This means we want to count how many bits are set in [1,0], [3,2], [5,4], and [7,6].  The result will be put into the same bit slots that we are counting.  Mask used: 0 1 0 1 0 1 0 1  So for every two bits, the mask is 01  Notice how the substraction gives the correct value for every possible bit value.  Bits 1, 0:    01              - 00     (01 shifted to the right one then ANDED with 01)              -----                01  Bits 3, 2:    10              - 01     (10 shifted to the right one then ANDED with 01)              -----                01  Bits 5, 4:    00              - 00     (00 shifted to the right one then ANDED with 01)              -----                00  Bits 7, 6:    11              - 01     (11 shifted to the right one then ANDED with 01)              -----                10  Value of x after Step 1:    | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |                                1   0   0   0   0   1   0   1  Every two bits now contains the number of bits that were set in the corresponding two bits of the original value.Step 2: Add each 2-bit count to its neighbor.  This means we want to add [1,0] to [3,2] and [7,6] to [5,4].    Mask used: 0 0 1 1 0 0 1 1  The mask is used to clear bits so that the final result can reside in the four adjacent bits.  So [3,2] + [1,0] will reside in [3,2,1,0] and [7,6] + [5,4] will be in [7,6,5,4].  Bits 3,2,1,0:       0001  (0101 ANDED with 0011)                    + 0001  (0101 shifted right 2, then ANDED with 0011)                    -----                      0010  Bits 7,6,5,4:       0000  (1000 ANDED with 0011)                    + 0010  (1000 shifted right 2, then ANDED with 0011)                     -----                      0010  Value of x after Step 2:    | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |                                0   0   1   0   0   0   1   0  Every four bits now contains the number of bits that were set in the corresponding 4 bits of the original value.Step 3: Add each 4-bit count to its neighbor.  This means we want to add [3,2,1,0] to [7,6,5,4].    No mask needed.  The result will be placed in the corresponding 8 bits.  Bits 7,6,5,4,3,2,1,0:       00100010                              + 00000010  (00100010 shifted right by 4)                            ----------                              00100100  Value of x after step 3:    | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |                                0   0   1   0   0   1   0   0  The lower four bits of each byte now contains the number of bits set in the corresponding byte of the original value.Step 4: Mask out top 4 bits of every byte, which is junk data.  Mask used: 1111000  Value of x after step 4:    | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |                                0   0   0   0   0   1   0   0Step 5: Add lower 4 bits of every byte together.  For a single byte, this step is not really necessary. The final result is already available, which is 4.  So for step 5 and 6, we will switch over to 32-bit and a made up intermediate result shown below.  For any junk of data larger than a byte, then we need to do this.  What we have at the end of step 4 is the number of bits set in each byte, which are located in the lower 4 bits of each byte.  We want to add together all these numbers. Multiplying by 0x01010101 accomplishes this because multiplication is addition with a few shifts included.  The result will end up being in the most significant byte.  Here is what it looks like, X denotes a 0 added because of a shift:                                00000100 00000011 00000110 00000001  (intermediate result, current x value)                              x 00000001 00000001 00000001 00000001  (0x01010101 to be multiplied to intermediate result)                              ------------------------------------                               |00000100|00000011 00000110 00000001                              0|00000000|00000000 00000000 0000000X                             00|00000000|00000000 00000000 000000XX                            000|00000000|00000000 00000000 00000XXX                           0000|00000000|00000000 00000000 0000XXXX                          00000|00000000|00000000 00000000 000XXXXX                         000000|00000000|00000000 00000000 00XXXXXX                        0000000|00000000|00000000 00000000 0XXXXXXX                       00000100|00000011|00000110 00000001 XXXXXXXX                               |        |                     0 00000000|00000000|00000000 0000000X XXXXXXXX                    00 00000000|00000000|00000000 000000XX XXXXXXXX                   000 00000000|00000000|00000000 00000XXX XXXXXXXX                  0000 00000000|00000000|00000000 0000XXXX XXXXXXXX                 00000 00000000|00000000|00000000 000XXXXX XXXXXXXX                000000 00000000|00000000|00000000 00XXXXXX XXXXXXXX               0000000 00000000|00000000|00000000 0XXXXXXX XXXXXXXX              00000100 00000011|00000110|00000001 XXXXXXXX XXXXXXXX                               |        |            0 00000000 00000000|00000000|0000000X XXXXXXXX XXXXXXXX           00 00000000 00000000|00000000|000000XX XXXXXXXX XXXXXXXX          000 00000000 00000000|00000000|00000XXX XXXXXXXX XXXXXXXX         0000 00000000 00000000|00000000|0000XXXX XXXXXXXX XXXXXXXX        00000 00000000 00000000|00000000|000XXXXX XXXXXXXX XXXXXXXX       000000 00000000 00000000|00000000|00XXXXXX XXXXXXXX XXXXXXXX      0000000 00000000 00000000|00000000|0XXXXXXX XXXXXXXX XXXXXXXX     00000100 00000011 00000110|00000001|XXXXXXXX XXXXXXXX XXXXXXXX     --------------------------------------------------------------Final result:                  |        |     [ All overflow, ignored  ]|00001110|00000000 00000000 00000000  Value of x after step 5:  00001110 00000000 00000000 00000000Step 6: Shift x right by 24 bits to put count into the lowest byte.  Final result: 00000000 00000000 00000000 00001110 (equal to 14) Now the result can be read from x like any other integer.

##### Share on other sites
Quote:
 Out of curiosity have any of you guys found any interesting uses for the population count operation? I've only ever used it to deal with sparse array bitmaps myself.

In the past, I've needed it for chess bitboards, processor affinity masks and memory allocator bitmaps. If the census transform turns out to be useful for the current stereo matching task, then that'd a good (very time-critical) example, too.

##### Share on other sites
Quote:
Original post by Jan Wassenberg
Quote:
 Out of curiosity have any of you guys found any interesting uses for the population count operation? I've only ever used it to deal with sparse array bitmaps myself.

In the past, I've needed it for chess bitboards, processor affinity masks and memory allocator bitmaps. If the census transform turns out to be useful for the current stereo matching task, then that'd a good (very time-critical) example, too.

From Wikipedia: "Cray supercomputers early on featured a population count machine instruction, rumoured to have been specifically requested by the U.S. government National Security Agency for cryptanalysis applications."

##### Share on other sites
It's useful when you're calculating the results of a binary correlation between two bit vectors. Very common in communication systems.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

• ### Forum Statistics

• Total Topics
628308
• Total Posts
2981974

• 9
• 13
• 11
• 12
• 11